Machine-Learning-03

Tensorflow, Keras, Deep Learning

Imron Rosyadi

Overview: Recognizing Handwritten Digits

Today, we’ll dive into building a neural network to recognize handwritten digits.

We’ll achieve ~99% accuracy using fewer than 100 lines of Python/Keras code.

This is a classic problem in Machine Learning, often tackled with the MNIST dataset.

What You’ll Learn

This session will cover key concepts and practical techniques:

  • What a neural network is and how it learns.
  • Building basic 1-layer neural networks with tf.keras.
  • Adding more layers for improved performance.
  • Implementing learning rate schedules.
  • Introduction to Convolutional Neural Networks (CNNs).
  • Regularization techniques: Dropout and Batch Normalization.
  • Understanding and mitigating overfitting.

Understanding the Training Data: MNIST

The MNIST dataset contains 60,000 labeled images of handwritten digits (0-9).

Each image is associated with its correct numerical label.

This “labeled dataset” is crucial for training.

Our neural network learns to classify these images into 10 classes (0 through 9).

Example MNIST Digits:

Training vs. Validation Datasets

How do we assess our model’s “real-world” performance?

  • Training Dataset: Used to update the model’s internal parameters. The model sees this data multiple times.
  • Validation Dataset: A separate, unseen labeled dataset to evaluate performance and prevent cheating. It reflects how well the model generalizes to new data.

Important

Using “unseen” data for validation is fundamental for robust model evaluation.

Monitoring Training Progress

During training, we track two key metrics: accuracy and loss.

Accuracy (Right Plot):

  • Percentage of correctly recognized digits.
  • Should increase as training progresses.

Loss (Left Plot):

  • Measures how “badly” the model performs.
  • The goal is to minimize this value.
  • Should decrease on both training and validation data.

X-axis: Epochs (iterations over entire dataset)

Making Predictions

After training, the model can predict digits it hasn’t seen.

This initial model reaches ~90% validation accuracy, meaning it still misclassifies 1000 out of 10,000 validation digits.

Caution

Even 90% accuracy leaves room for improvement, especially in critical ECE applications like medical imaging or autonomous systems.

Understanding Tensors: The Language of Data

In deep learning, data is represented as tensors. Tensors are multi-dimensional arrays, analogous to vectors and matrices.

  • Grayscale Image (28x28 pixels): A 2D tensor (matrix) with shape [28, 28].

  • Color Image (28x28 pixels, RGB): A 3D tensor with shape [28, 28, 3]. (Height, Width, Color Channels)

  • Batch of Color Images (e.g., 128 images): A 4D tensor with shape [128, 28, 28, 3]. (Batch Size, Height, Width, Color Channels)

Note

The list of dimensions is called the “shape” of the tensor.

Understanding tensor shapes is crucial for building and debugging neural networks.

Interactive Example: Image Compression Analogy

Let’s visualize how much information we retain when we reduce the “dimensions” of an image. This is analogous to how neural networks extract features.

Introduction to Neural Networks

Neural Networks are powerful computational models inspired by the human brain. They are used to learn complex patterns from data.

For ECE, neural networks are crucial in:

  • Signal Processing: Noise reduction, feature extraction.
  • Image Recognition: Object detection, medical imaging analysis.
  • Control Systems: Adaptive control, robotics.

The Keras Sequential API

When building neural networks with TensorFlow and Keras, the Sequential API is a straightforward way to stack layers. This is ideal for models where layers have exactly one input tensor and one output tensor.

Example: Image Classifier using Dense Layers

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28, 1]), # Flattens input images
    tf.keras.layers.Dense(200, activation="relu"),    # Hidden layer with ReLU
    tf.keras.layers.Dense(60, activation="relu"),     # Another hidden layer
    tf.keras.layers.Dense(10, activation='softmax')   # Output layer for 10 classes
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
# Train the model
# model.fit(dataset, ...)

What is a Neuron? The Basic Building Block

The fundamental unit of a neural network is the neuron, a concept analogous to a processing unit in digital circuits.

Each neuron performs three main operations:

  1. Weighted Sum: Multiplies each input by a corresponding weight and sums them up.
  2. Bias Addition: Adds a bias constant to the weighted sum.
  3. Activation: Passes the result through a non-linear activation function.

The weights (W) and biases (b) are the parameters learned during training. Initially, they are random and get adjusted to minimize error.

neuron.png

Tip

Think of weights as variable resistors and biases as constant voltage offsets in an analog circuit. The activation function is like a threshold detector or a non-linear amplifier.

Single Dense Layer: MNIST Example

Let’s consider classifying handwritten digits from the MNIST dataset. Each image is 28x28 pixels grayscale.

The simplest neural network for this task uses 784 pixels (28x28) as inputs to a single dense layer.

This layer has 10 output neurons, one for each digit class (0-9).

Each of these 10 output neurons takes all 784 pixel values as input, performs a weighted sum, adds a bias, and applies an activation.

G pixel_1 Pixel 1 pixel_2 Pixel 2 digit_0 Digit 0 pixel_1->digit_0 digit_9 Digit 9 pixel_1->digit_9 pixel_3 ... pixel_784 Pixel 784 pixel_784->digit_0 pixel_784->digit_9 digit_1 Digit 1 digit_2 ...

Matrix Multiplication for a Single Layer

A dense layer’s operations can be efficiently represented using matrix multiplication.

If X is a matrix of 100 images (each flattened to 784 pixels), and W is the weight matrix (784 inputs x 10 outputs), then:

\[ \text{Weighted Sums} = X \cdot W \]

\[ \text{Output} = \text{Activation}(X \cdot W + b) \]

Where b is the bias vector (10 elements), broadcasted across the 100 images.

Matrix Multiplication for a Single Layer

matmul.gif

In Keras, this is simplified:tf.keras.layers.Dense(10, activation='softmax')

Going Deep: Chaining Layers

“Deep learning” refers to using multiple hidden layers. Each layer computes weighted sums of the outputs of the previous layer.

This architecture allows the network to learn progressively more complex and abstract features from the raw input data.

For example, early layers might detect edges or simple shapes, while later layers combine these to recognize parts of objects or entire objects.

The choice of activation function is critical and typically changes only for the very last layer in a classifier.

Going Deep: Chaining Layers

fba0638cc213a29.png

Activation Functions: ReLU and Softmax

Activation functions introduce non-linearity, allowing neural networks to learn complex, non-linear relationships in data.

Sigmoid

  • The most classical
  • Used on intermediate layers

Rectified Linear Unit (ReLU)

\(f(x) = \max(0, x)\)

  • Most popular activation for hidden layers.
  • Simple and computationally efficient.
  • Helps prevent vanishing gradients.

Softmax

\[ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \]

  • Used in the output layer of multi-class classifiers.
  • Converts logits into probabilities that sum to 1.

sigmoid.png

relu.png

Softmax in Action: Interactive Example

Adjust the Logit Value for a single class and observe how Softmax normalizes probabilities. Here, we simulate 10 classes, with one Logit Value adjusted at a time.

Loss Function: Cross-Entropy

To train a neural network, we need to measure how “wrong” its predictions are compared to the true labels. This measure is called the loss function.

For multi-class classification, Cross-Entropy Loss is the standard.

\[ H(p, q) = - \sum_{i=1}^{K} p_i \log(q_i) \]

Where: * \(p_i\) is the true probability for class \(i\) (1 for the correct class, 0 otherwise). * \(q_i\) is the predicted probability for class \(i\) (output of softmax). * \(K\) is the number of classes.

Note

Cross-entropy loss heavily penalizes incorrect high-confidence predictions, guiding the network to both be correct and confident.

Loss Function: Cross-Entropy

cross_entropy.png

Training: Gradient Descent

“Training” a neural network means iteratively adjusting its weights and biases to minimize the loss function. This is achieved using an optimization algorithm called Gradient Descent.

  1. Compute Gradient: Calculate the partial derivatives of the loss function with respect to every weight and bias. This “gradient” vector points in the direction of the steepest increase of the loss.
  2. Update Parameters: Adjust weights and biases in the opposite direction of the gradient, typically by a small step size called the learning rate.

\[ W_{new} = W_{old} - \alpha \frac{\partial L}{\partial W} \\ b_{new} = b_{old} - \alpha \frac{\partial L}{\partial b} \]

This process is repeated over many epochs (passes through the entire dataset).

Training: Gradient Descent

gradient_descent.png

Mini-Batching and Momentum

To improve training efficiency and stability:

Mini-Batching

  • Instead of computing the gradient for one image at a time, we use a batch (e.g., 32, 64, or 128 images).
  • Provides a more stable and representative gradient estimate.
  • Leverages highly optimized matrix operations on GPUs/TPUs.

Momentum

  • Adds a fraction of the previous update vector to the current update.
  • Helps overcome local minima, saddle points, and speeds up convergence in relevant directions.

Mini-Batching and Momentum

saddle_point.png

Illustration: A saddle point, where the gradient is zero but not a true minimum in all directions.

Glossary of Key Terms

  • Batch/Mini-batch: A subset of the training data used in one iteration of gradient descent.
  • Cross-Entropy Loss: A common loss function for classification tasks, measuring dissimilarity between predicted and true probability distributions.
  • Dense Layer: A layer where each neuron is connected to every neuron in the preceding layer.
  • Features: The input attributes or data points fed into a neural network.
  • Labels: The correct outputs or target values in supervised learning.
  • Learning Rate: A hyperparameter controlling the step size during gradient descent.
  • Logits: The raw, unnormalized outputs of a neural network layer before being passed through an activation function like softmax.
  • Loss Function: A function that quantifies the error between predicted outputs and true labels.
  • Neuron: The fundamental computational unit of a neural network.
  • One-Hot Encoding: A categorical variable representation where each category is a binary vector (e.g., [0,0,1,0] for class 3 of 4).
  • ReLU (Rectified Linear Unit): A popular activation function, \(f(x) = \max(0, x)\).
  • Sigmoid: An S-shaped activation function, \(f(x) = 1 / (1 + e^{-x})\).
  • Softmax: An activation function that converts a vector of numbers into a probability distribution.
  • Tensor: A generalization of vectors and matrices to an arbitrary number of dimensions.

Diving into the Code (Part 1)

Now, let’s dissect the code from the keras_01_mnist.ipynb notebook. Understanding each section is key to building and modifying models.

Note

Follow along in the Colab notebook if you can!

We’ll cover core components: - Model Parameters and Imports - Data Preparation with tf.data.Dataset - Building a Keras Sequential Model - Training and Validation - Visualizing Predictions

Model Parameters and Imports

These initial cells set up the environment and define global constants.

Parameters Cell:

Sets values for:

  • BATCH_SIZE: Number of samples processed per gradient update.
  • EPOCHS: Number of complete passes through the training dataset.
  • GCS_PATTERN: Location of MNIST data files on Google Cloud Storage.

Imports Cell:

Imports necessary libraries:

  • tensorflow (tf): Core Deep Learning framework.
  • numpy (np): For numerical operations (especially tensor manipulation).
  • matplotlib.pyplot (plt): For plotting and visualization.
# Example of Parameters
BATCH_SIZE = 64
EPOCHS = 5
GCS_PATTERN = "gs://cloud-tpu-datasets/mnist/mnist_{}.tfrec"

print(f"Batch Size: {BATCH_SIZE}")
print(f"Epochs: {EPOCHS}")
# Example of Imports
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

print("TensorFlow version:", tf.__version__)
print("NumPy version:", np.__version__)

Data Preparation with tf.data.Dataset

The tf.data.Dataset API is powerful for building efficient data pipelines. It handles loading, parsing, and preprocessing data, especially at scale.

Key Steps:

  1. Load Fixed-Length Records: Images and labels are stored in tfrec files. We decode raw byte strings into images (float32, normalized 0-1) and flatten them.

    imagedataset = tf.data.FixedLengthRecordDataset(image_filename, 28*28, header_bytes=16)
  2. read_image Function: Parses byte strings into float32, normalizes pixels (0-1), and reshapes to [28*28] (flattened for initial dense layer).

    def read_image(tf_bytestring):
        image = tf.io.decode_raw(tf_bytestring, tf.uint8)
        image = tf.cast(image, tf.float32)/256.0
        image = tf.reshape(image, [28*28])
        return image

Data Preparation (Cont.): Pipeline Operations

After parsing, we apply various transformations to optimize the dataset for training:

1. Mapping & Zipping:

  • Apply read_image to all images using .map().
  • Do similar steps for labels.
  • Combine images and labels using .zip(): dataset = tf.data.Dataset.zip((imagedataset, labelsdataset))

2. Optimizations:

  • .cache(): Store dataset in RAM for faster epoch transitions (for small datasets).
  • .shuffle(buffer_size): Randomize training order to prevent batch order biases.
  • .repeat(): Loop the dataset indefinitely for multiple epochs.
  • .batch(batch_size): Group samples into mini-batches for efficient processing.
  • .prefetch(tf.data.experimental.AUTOTUNE): Overlap data preprocessing and model execution to keep the GPU busy.

Building a Keras Sequential Model: The 1-Layer Network

Our first model is a simple, single-layer dense neural network.

Model Definition:

  • We use tf.keras.Sequential for a linear stack of layers.
  • tf.keras.layers.Input(shape=(28*28,)): Defines the input shape (784-element flattened vector).
  • tf.keras.layers.Dense(10, activation='softmax'):
    • 10 neurons: One for each digit class (0-9).
    • 'softmax' activation: Outputs a probability distribution over the 10 classes, summing to 1. (The highest probability indicates the predicted class.)

G Input Input Layer (Shape: 784) Dense Dense Layer (10 neurons, Softmax) Input->Dense Output Output (10 probabilities) Dense->Output

Compiling the Model

Before training, the model needs to be compiled. Compilation configures key aspects of the training process.

model.compile(optimizer='sgd',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
  • optimizer='sgd' (Stochastic Gradient Descent): The algorithm used to update the model’s weights based on the loss. SGD is a foundational optimizer for neural networks.
  • loss='categorical_crossentropy': The loss function measures the discrepancy between predicted and true class probabilities. Categorical crossentropy is standard for multi-class classification when labels are one-hot encoded.
  • metrics=['accuracy']: Additional metrics to monitor during training and evaluation. 'accuracy' measures the percentage of correct predictions.

Model Summary & Training Utility

After compilation, we can inspect the model’s architecture.

model.summary():

Prints a detailed overview of the model:

  • Layers (type, output shape).
  • Number of trainable parameters in each layer.
  • Total parameters in the model.

This is invaluable for debugging and understanding model complexity.

PlotTraining Callback:

A custom utility (from the notebook) to visualize training curves dynamically. It shows loss and accuracy for both training and validation sets in real-time.

import tensorflow as tf
# Define a simple model for demonstration
model_summary_demo = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28*28,)), # For MNIST, inputs are 784-element vectors
    tf.keras.layers.Dense(10, activation='softmax') # 10 output classes
])

# Simulate compile for summary to show expected parameters
model_summary_demo.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model_summary_demo.summary()

Training and Validation

The model.fit() function is where the actual learning takes place.

model.fit(training_dataset,
          steps_per_epoch=steps_per_epoch,
          epochs=EPOCHS,
          validation_data=validation_dataset,
          validation_steps=1,
          callbacks=[plot_training])
  • training_dataset: The preprocessed dataset used for learning.
  • steps_per_epoch: Number of batches per epoch (total training samples / batch size).
  • epochs: Total number of times the model iterates over the entire training data.
  • validation_data: The unseen dataset used to evaluate performance after each epoch.
  • validation_steps: Number of batches from validation_data to run per validation round.
  • callbacks: List of custom functions executed at various stages of training (e.g., plot_training).

Visualizing Predictions

After training, we use model.predict() to evaluate the model on new inputs.

probabilities = model.predict(font_digits, steps=1)
predicted_labels = np.argmax(probabilities, axis=1)
  • model.predict(input_data): Generates output predictions for the input samples. For a classification model with softmax activation, it returns a 2D array where each row is a probability distribution over the classes for one input. (e.g., [[0.01, 0.05, ..., 0.90, ..., 0.02], ...])
  • np.argmax(probabilities, axis=1): Converts the probability distributions into a single predicted class label.
    • np.argmax(): Returns the index of the maximum value.
    • axis=1: Specifies to find the maximum along the “class” dimension (i.e., for each image, find the class with the highest probability).

Visualizing Predictions

Note

This simple 1-layer model already achieves ~90% accuracy! But we can do much better.

Adding Layers: Going Deeper

To improve our model’s accuracy beyond 90%, we need to add more layers. This allows the network to learn more complex, hierarchical features.

The Concept of Depth

  • A deeper network can model non-linear relationships more effectively.
  • Each hidden layer learns increasingly abstract representations of the input data.

Activation Functions Revisited

While softmax is for the output layer of a classifier, hidden layers need different activation functions.

Sigmoid Activation Function

For intermediate (hidden) layers, the sigmoid function is a classical choice:

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

  • Output Range: Maps any input to a value between 0 and 1.
  • Interpretation: Can be seen as a “soft” switch, where values close to 0 or 1 indicate strong decisions.
  • Historical Significance: Widely used in early neural networks.

Plot of the Sigmoid Function

Designing a Deeper Model

Let’s expand our simple model by adding two hidden Dense layers with sigmoid activation.

model = tf.keras.Sequential(
    [
        tf.keras.layers.Input(shape=(28*28,)),        # Input Layer
        tf.keras.layers.Dense(200, activation='sigmoid'), # Hidden Layer 1
        tf.keras.layers.Dense(60, activation='sigmoid'),  # Hidden Layer 2
        tf.keras.layers.Dense(10, activation='softmax')   # Output Layer
    ]
)
# (Compilation and training code follows)
  • Hidden Layer 1: 200 neurons with sigmoid activation.
  • Hidden Layer 2: 60 neurons with sigmoid activation.
  • Output Layer: Remains 10 neurons with softmax for classification.

G Input Input Layer (Shape: 784) Dense1 Dense Layer 1 (200 neurons, Sigmoid) Input->Dense1 Dense2 Dense Layer 2 (60 neurons, Sigmoid) Dense1->Dense2 Output Output Layer (10 neurons, Softmax) Dense2->Output

Note

Notice the increase in the number of parameters with multiple layers. Run this model in the Colab notebook.

Unexpected Behavior: What Happened?

Despite adding layers and parameters, the model doesn’t always improve as expected.

  • High Loss: The training loss and validation loss are extremely high.
  • Low Accuracy: Accuracy barely increases above random guessing (around 10%).

Warning

More parameters don’t automatically mean better performance. Deeper networks introduce new challenges!

Why Did the Deeper Model Fail? The Vanishing Gradient Problem

The sigmoid activation function can hinder learning in deep networks.

The Problem:

  • The gradient (derivative) of the sigmoid function is very small for inputs far from 0.
  • In a deep network, these small gradients are multiplied together during backpropagation.
  • This causes gradients to “vanish” as they propagate back to earlier layers.

Consequence for ECE:

  • Early layers’ weights are hardly updated.
  • The network struggles to learn useful features from the input.
  • Training stalls, leading to poor performance.

Important

This is a common issue with traditional activation functions like sigmoid and tanh in deep architectures.

Special Care for Deep Networks

The “AI winter” of the 80s and 90s was partly due to the challenges of training deep networks. Modern deep learning thrives due to “dirty tricks” that ensure convergence.

Overcoming Deep Network Challenges

  • RELU Activation: A simple yet powerful non-linearity.
  • Better Optimizers: Algorithms that navigate complex loss landscapes.
  • Careful Initialization: Setting initial weights to facilitate learning.
  • Numerical Stability: Ensuring calculations don’t crash.

RELU Activation: The Modern Choice

The sigmoid function’s vanishing gradients made it problematic for deep networks. The Rectified Linear Unit (RELU) is the de-facto standard activation today.

\[ \text{ReLU}(x) = \max(0, x) \]

  • Simplicity: Returns x for positive inputs, 0 for negative inputs.
  • Gradient: Has a constant gradient of 1 for positive inputs.
  • Benefits for ECE:
    • Mitigates vanishing gradient problem.
    • Speeds up convergence.
    • Computationally much cheaper than sigmoid/tanh.

Plot of the ReLU Function

Note

Replace activation='sigmoid' with activation='relu' in hidden layers. The output layer retains softmax for classification.

Better Optimizers: Beyond SGD

Stochastic Gradient Descent (SGD) can get stuck in “saddle points” in high-dimensional spaces.

Modern optimizers are more robust and efficient.

Saddle Points:

  • Points in the loss landscape where the gradient is zero, but it’s not a true minimum.
  • SGD can get stuck here, preventing further learning.

Adaptive Optimizers:

  • Use concepts like “momentum” and “adaptive learning rates” for each parameter.
  • Help the model “sail past” saddle points and converge faster.
  • Examples: Adam, RMSprop, Adagrad.

Keras Implementation:

Update the optimizer in model.compile:

model.compile(optimizer='adam', # Use Adam optimizer
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Tip

Adam is widely considered a good default choice for most deep learning tasks.

Weight Initialization & Numerical Stability

Two critical, often hidden, factors for stable deep network training.

1. Random Initializations:

  • How the network’s weights and biases are set before training begins.
  • Poor initialization can lead to slow convergence or vanishing/exploding gradients.
  • Keras Default: Uses 'glorot_uniform' (also known as Xavier uniform).
    • Designed to keep activation values and gradients roughly in the same scale across layers.
    • No action needed: Keras handles this optimally by default.

2. Numerical Stability (NaNs):

  • Categorical crossentropy involves log(). If input to log is 0, it’s NaN (Not a Number).
  • softmax output (probabilities) can be numerically 0 in float32 despite being mathematically non-zero.
  • Keras Solution: tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    • Computes softmax and crossentropy together in a numerically stable way.
    • No action needed: Keras handles this automatically when softmax is the last activation and categorical_crossentropy is the loss.

Success So Far: ~97% Accuracy!

With ReLU activation and the Adam optimizer, our deeper model should now converge effectively.

You should observe:

  • Training and validation loss decreasing steadily.
  • Training and validation accuracy climbing to around 97%.

This marks a significant improvement over the initial 90% and the failed deep sigmoid model.

We’re approaching our goal of “significantly above 99% accuracy!”

(Example of ~97% accuracy training curves)

Tip

If you’re stuck, refer to keras_02_mnist_dense.ipynb in the Colab repo.

Learning Rate Decay: Fine-Tuning Convergence

Training too fast can lead to noisy convergence or even divergence. A learning rate decay schedule starts fast and slows down over time.

The Problem with High Learning Rates:

  • Training curves become noisy.
  • Validation metrics jump erratically (d4fd66346d7c480e.png).
  • Model might jump over optimal solutions or oscillate.

The Solution:

  • Start with a higher learning rate to explore the loss landscape quickly.
  • Gradually decrease the learning rate as training progresses, allowing for finer adjustments and more stable convergence.
  • Often exponential decay: \(LR = LR_0 \times \text{decay_rate}^{\text{epoch}}\)

Implementing Learning Rate Decay with Keras

Keras makes it easy to add a learning rate scheduler using a callback.

1. Define the Decay Function: A Python function that calculates the learning rate for a given epoch.

import math
def lr_decay(epoch):
    return 0.01 * math.pow(0.6, epoch) # Exponential decay

This function starts at 0.01 and reduces it by 0.6 (60%) each epoch.

2. Create the Callback:

lr_decay_callback = tf.keras.callbacks.LearningRateScheduler(lr_decay, verbose=True)

verbose=True prints the learning rate at the start of each epoch.

3. Add to model.fit():

Include the lr_decay_callback in the list of callbacks.

model.fit(..., callbacks=[plot_training, lr_decay_callback])

Implementing Learning Rate Decay with Keras

Important

The lr_decay_callback must be added to the callbacks list for it to take effect.

Impact of Learning Rate Decay: Cleaner Convergence

The effect of learning rate decay is often dramatic.

  • Noise Reduction: Training curves become much smoother.
  • Stable Validation: Validation accuracy and loss fluctuate less, showing cleaner convergence.
  • Improved Accuracy: Sustained test accuracy can now be observed above 98%.

This technique is crucial for pushing models to higher performance ceilings.

(Training curves with learning rate decay)

7. Overfitting and Dropout

Even with a deeper network, ReLU, Adam, and learning rate decay, we often hit a wall around 98% accuracy. This is frequently due to overfitting.

The Signs of Overfitting

  • Validation loss increases: While training loss continues to decrease.
  • Validation accuracy plateaus or drops: While training accuracy keeps improving.

This means the model is learning details specific to the training data, but failing to generalize to new, unseen data (like our validation set).

Example: Validation loss rising while training loss drops

Dropout: A Regularization Technique

Dropout is a powerful and widely used technique to combat overfitting.

How it Works:

  • During each training iteration, a random subset of neurons (and their connections) are temporarily “dropped out” (set to zero).
  • This means the network cannot rely on any single neuron to be present.
  • It forces the network to learn more robust and redundant features.

Analogy:

Like training multiple smaller, “thinner” networks simultaneously, combined into one.

Keras Implementation:

Add tf.keras.layers.Dropout to your model:

model = tf.keras.Sequential([
    # ... previous layers
    tf.keras.layers.Dense(200, activation='relu'),
    tf.keras.layers.Dropout(0.2), # Dropout layer
    tf.keras.layers.Dense(60, activation='relu'),
    tf.keras.layers.Dropout(0.2), # Another Dropout layer
    tf.keras.layers.Dense(10, activation='softmax')
])

The 0.2 indicates dropping out 20% of neurons.

Dropout’s Initial Impact: A Complex Picture

Applying dropout can lead to a mixed initial response.

(Training curves with Dropout)

  • Increased Noise: The training curves show more fluctuation due to the random dropping of neurons.
  • Higher Overall Loss: Both training and validation loss might be higher than without dropout.
  • Slight Accuracy Drop: Validation accuracy might initially decrease.

Note

This doesn’t mean dropout failed; it indicates the model is being forced to learn differently.

We are pushing it to generalize better, not just memorize.

Deeper Roots of Overfitting: The Nature of the Problem

Overfitting isn’t always easily fixed by dropout alone; it stems from fundamental issues.

1. “Too Many Degrees of Freedom”:

  • If a network is too large for the complexity of the data, it can simply “memorize” training examples.
  • It fails to extract underlying patterns, resulting in poor generalization.

Analogy for ECE:

  • Imagine fitting a 10th-order polynomial to only three data points. It will perfectly hit those points but be wild everywhere else.

2. Insufficient Training Data:

  • Neural networks are data-hungry.
  • With too little data, even a reasonably sized network can overfit because there isn’t enough variety to learn robust patterns.

3. Inadequate Network Architecture:

  • Sometimes, the chosen network type isn’t suitable for the data’s structure.
  • Our current Dense (fully-connected) only network struggles with image spatial relationships.

Introduction to Convolutional Neural Networks (CNNs)

Our current model struggles because it treats image pixels as independent features, losing spatial context. Convolutional Neural Networks (CNNs) are designed to leverage this spatial information.

Key Idea:

  • Instead of fully-connected layers, CNNs use convolutional filters (kernels).
  • These filters slide across the input image, detecting local features like edges, corners, and textures.
  • They preserve the spatial relationships between pixels.

Benefits for ECE (Image Processing):

  • Feature Hierarchy: Learn increasingly complex features (edges -> shapes -> object parts).
  • Parameter Sharing: Detect the same feature anywhere in the image with the same filter.
  • Translation Invariance: Robust to slight shifts in object position.
  • Crucial for applications like object detection, medical imaging, and autonomous systems.

Important

We’ve hit a performance ceiling with our dense network because it fundamentally misunderstands image data. CNNs are the game-changer here!