Machine Learning

Image Classification: Data-driven Approach, k-Nearest Neighbor, train/val/test splits

Imron Rosyadi

Introduction to Nearest Neighbor Classifiers

What is Image Classification?

Given: A set of images, each labeled with a single category (e.g., “cat”, “dog”, “car”).

Goal: Train a model to predict the category of new, unseen images.

ECE Relevance:

Autonomous Systems: Object detection in self-driving cars.
Medical Imaging: Diagnosing diseases from X-rays or MRI scans.
Quality Control: Detecting defects in manufacturing.
Signal Processing: Classifying radar or sonar signals into object types.

Before we jump into Nearest Neighbors, let’s set the stage. What exactly is image classification? Imagine you have a library of images, each already tagged with what’s inside – cats, dogs, cars, etc. The challenge for us is to build an intelligent system that can look at a brand new image it has never seen before and correctly assign it to one of these predefined categories.

This isn’t just an abstract computer science problem; it has profound implications in ECE. Think about autonomous vehicles needing to instantly recognize pedestrians, traffic signs, or other cars to ensure safety. In medical imaging, machine learning helps doctors identify anomalies in scans. In manufacturing, automated systems can ensure product quality by detecting tiny defects. Even in signal processing, we classify patterns in sensor data, like identifying different types of aircraft from radar signals. All these applications rely on effective classification.

The Nearest Neighbor (NN) Principle

“Birds of a feather flock together.”

The core idea is simple: - Store all training data. - To classify a new point: Find the closest training example. - Assign the new point the label of its closest training example.

Important

How do we define “closest”? We use Distance Metrics!

The Nearest Neighbor, or NN, classifier is beautifully simple. Its core principle is like the old saying: “birds of a feather flock together.” If you want to know what something is, look at what it’s most similar to from things you already know.

In practice, this means our classifier simply memorizes all the training examples. When a new, unknown data point comes in – let’s say a new image – the classifier compares it to every single training image it has. It then identifies which training image is the “closest” or “most similar” and simply assigns that training image’s label to our new, unknown image.

But this raises a critical question: how do we quantitatively define “closest” or “most similar”? This is where distance metrics come into play. They provide a mathematical way to measure the dissimilarity between two data points.

Defining “Closeness”: Distance Metrics

Common choices for d(I_1, I_2) (distance between images I_1 and I_2):

L1 Distance (Manhattan Distance): \[ d_1(I_1, I_2) = \sum_{p} |I_1(p) - I_2(p)| \]
- Sums absolute differences across all pixel values.
- Represents distance if moving only horizontally or vertically.
L2 Distance (Euclidean Distance): \[ d_2(I_1, I_2) = \sqrt{\sum_{p} (I_1(p) - I_2(p))^2} \]
- Sums squared differences, then takes the square root.
- Represents the straight-line distance in a multi-dimensional space.

Tip

In ECE, these distances are fundamental for comparing signals or data vectors.

When we talk about images, each image can be thought of as a high-dimensional vector, where each pixel value is a dimension. So, subtracting two images means subtracting their pixel values.

The L1 distance, also known as Manhattan or city-block distance, is like navigating a grid. Imagine you’re in a city with perfectly laid-out streets; to get from point A to point B, you sum the horizontal and vertical distances. For images, we sum the absolute differences of each corresponding pixel.

The L2 distance, or Euclidean distance, is what you typically think of as “distance” in everyday life – the shortest straight line between two points. For images, we square the differences of corresponding pixels, sum them, and then take the square root. This emphasizes larger differences more than L1.

Both L1 and L2 distances are widely used in ECE not just for images, but for comparing sound signals, sensor readings, or any vectorized data. Understanding these basic metrics is crucial for many signal processing and machine learning tasks.

k-Nearest Neighbors (k-NN) Classifier

Beyond a single neighbor

Instead of just one neighbor, k-NN considers the k closest training examples. The label of the new point is determined by a majority vote among its k nearest neighbors.

Key Hyperparameter: k

k = 1 is the basic Nearest Neighbor.
Choosing k > 1 provides a smoother decision boundary and can reduce sensitivity to noisy data points.

Note

A small k can be sensitive to noise. A large k can blur boundaries.

viewof k_value = Inputs.range([1, 15], {step: 2, value: 3, label: "Select k:"});

The concept of Nearest Neighbor is extended to k-Nearest Neighbors, or k-NN. Instead of just picking one closest example, k-NN looks at the k closest examples. The label for the new data point is then decided by a majority vote among these k neighbors.

The value of k is a crucial choice. If k is 1, it’s just the basic NN. If k is larger, like 3 or 5, the classifier tends to be more robust to noisy training data. A single outlier won’t sway the decision as much. However, if k is too large, it might smooth out the decision boundaries too much, potentially ignoring local patterns. This interactive plot on the right lets you play with the k value. Observe how increasing k changes the colored decision regions and even the classification of our ‘New Point’ by making the boundaries smoother. For ECE students, this directly relates to parameter tuning in filters or control systems, where a single parameter can dramatically alter system behavior.

The Challenge: Choosing Hyperparameters

Hyperparameters are settings that control the learning process, not learned from data.

For k-NN, key hyperparameters include:

The value of k.
The distance metric (L1, L2, etc.).

Caution

Problem: If we choose k based on how well the model performs on the test data, we are essentially “cheating.” The model would seem to perform better than it would on truly unseen data. This is called overfitting to the test set.

So, we know we need to choose k, and we need to choose a distance metric. These are what we call hyperparameters. They are not learned directly from the training data, like the weights in a neural network; instead, they are set before the learning process begins and influence how the model learns.

The critical challenge here is figuring out the best values for these hyperparameters. It’s tempting to just try different k values and pick the one that gives the highest accuracy on our test dataset. However, this is a major pitfall in machine learning. If we do this, we are effectively using the test set to train our model – implicitly, by guiding our hyperparameter choices. This leads to an overly optimistic performance estimate. Our model would look great on that specific test set, but when deployed to truly new, unseen data, its performance would likely drop significantly. This is a form of overfitting to the test set, and it’s something we absolutely must avoid to build reliable ECE systems.

Data Splitting: The Holy Trinity

To correctly tune hyperparameters, we divide our data into three distinct sets:

Training Set (e.g., 60-80%):
- Used to train the model (e.g., storage for k-NN).
- The model learns from this data.
Validation Set (e.g., 10-20%):
- A “fake test set” used to tune hyperparameters.
- Provides an unbiased estimate of model performance during development.
Test Set (e.g., 10-20%):
- Used for a single, final evaluation of the chosen model.
- Provides an unbiased estimate of generalization performance on truly unseen data.

Important

Golden Rule: The Test Set is used only once, at the very end!

To address the hyperparameter challenge correctly, we partition our available dataset into three distinct, non-overlapping subsets: the training set, the validation set, and the test set.

The training set is the largest portion. This is the data our k-NN model “sees” and “memorizes.” All of the examples it will ever use for classification will come from here.

The validation set acts as our crucial bridge during development. It’s like a practice test. We use it to evaluate different hyperparameter settings (like varying k). By seeing how our model performs on the validation set with different k values, we can pick the best k without ever touching our final exam data. This gives us an honest estimate of how particular hyperparameter choices impact performance.

Finally, the test set is the most sacred. This set is kept entirely separate and unseen throughout the entire development and tuning process. Only after we have finalized our model and chosen all hyperparameters based on the validation set, do we run our model once on the test set. This single evaluation provides the most reliable measure of our model’s performance on truly unseen data – its ability to generalize.

The Golden Rule for all ECE machine learning projects is: The Test Set is used only once, at the very end! Never touch it for tweaking anything during development.

Tuning Hyperparameters with a Validation Set

An example with `k`

# assume we have Xtr_rows, Ytr, Xte_rows, Yte (e.g., 50,000 CIFAR-10 images)
# Xtr_rows is 50,000 x 3072 matrix (image data)
# Ytr are the 50,000 corresponding labels

# 1. Split training data into a smaller training set and a validation set
Xval_rows = Xtr_rows[:1000, :] # Take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # Use remaining 49,000 for actual training
Ytr = Ytr[1000:]

# 2. Iterate through different k values and evaluate on the validation set
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]: # Try various 'k' values

    # (Imagine a NearestNeighbor class here with a 'train' and 'predict' method)
    # nn = NearestNeighbor()
    # nn.train(Xtr_rows, Ytr) # Model stores the (now smaller) training data

    # Yval_predict = nn.predict(Xval_rows, k = k) # Predict on validation data for current 'k'
    # acc = np.mean(Yval_predict == Yval) # Calculate accuracy

    # --- Simplified for demonstration using placeholder values ---
    # In a real scenario, the above commented lines would compute actual accuracy.
    # For this interactive demo, assume dummy accuracy based on 'k'.
    if k == 1: acc = 0.38
    elif k == 3: acc = 0.42
    elif k == 5: acc = 0.44
    elif k == 10: acc = 0.41
    elif k == 20: acc = 0.35
    elif k == 50: acc = 0.28
    elif k == 100: acc = 0.20
    # --- End simplified part ---

    print(f'k = {k}: accuracy = {acc:.4f}')
    validation_accuracies.append((k, acc))

# 3. Choose the 'k' that gave the best accuracy on the validation set
best_k = max(validation_accuracies, key=lambda item: item[1])[0]
print(f"\nBest k on validation set: {best_k}")

Let’s illustrate this process with a concrete example, just like you’d see in a typical ECE project dealing with sensor data or image classification.

Here, we start with our full training data. Our first step is to carve out a small portion – say, the first 1000 data points – to serve as our validation set. The remaining data then becomes our actual training set for the model.

Next, we loop through different possible values for k, our hyperparameter. For each k, we conceptually “train” our Nearest Neighbor model on the reduced training set and then evaluate its performance solely on the validation set. We calculate the accuracy and record it.

After testing all our desired k values, we look at the validation_accuracies and pick the k that yielded the highest performance on that validation set. This chosen k is now our optimal hyperparameter. Importantly, we still haven’t touched the final test set. In an ECE context, this careful tuning allows us to develop robust systems that perform predictably in the field.

When Validation Data is Small: Cross-Validation

Problem: If your dataset is small, a single validation split might not be representative (noisy estimate).
Solution: Cross-validation provides a more robust estimate of performance for hyperparameter tuning.

How it works (e.g., 5-fold CV):

Divide the training data into N equal “folds” (e.g., 5).
Iterate N times:
- Use N-1 folds for training.
- Use the remaining 1 fold for validation.
- Record performance for the current hyperparameter setting.
Average the performance across all N iterations.

Tip

Common folds: 3, 5, or 10. More folds offer a better estimate but are more computationally expensive.

What if your dataset is quite small? If you carve out a significant portion for a validation set, your remaining training set might be too small to adequately learn patterns. Conversely, if your validation set is tiny, its performance might be very noisy and not a reliable indicator of the true quality of your hyperparameters.

This is where cross-validation steps in. It’s a more sophisticated and computationally intensive technique, but it gives a much more reliable estimate of hyperparameter effectiveness.

The idea is that instead of just one split, you perform multiple splits. For instance, in 5-fold cross-validation, you divide your entire training data into five equal segments. Then, in five separate rounds: in each round you designate one segment as the validation fold and the other four as the training folds. You train and evaluate for each hyperparameter setting for each of these five configurations. Finally, you average the performance results from these five rounds. This averaging reduces the variance in your performance estimate, making it more trustworthy. For ECE applications with limited data, such as specialized sensor readings, cross-validation is invaluable for ensuring robust model selection.

Practical Considerations for Data Splits

Typical Split Ratios:
- Training: 50-90%
- Validation: 10-20%
- Test: 10-20%
When to favor Cross-Validation over a single split:
- Small dataset size (validation set would be too small).
- If a very accurate estimate of hyperparameter performance is crucial.

Note

Cross-validation is computationally more expensive. Choose between a single validation split and cross-validation based on dataset size, available computational resources, and the number of hyperparameters to tune.

In practice, deciding on the exact proportions for your training, validation, and test sets often comes down to the specifics of your project. As a general rule of thumb, you’ll reserve the largest portion, typically 50-90%, for training your model. The validation and test sets usually make up the remaining percentages equally.

The decision to use a single validation split versus more complex cross-validation hinges on a few factors: If your dataset is large, a single validation split is often sufficient and computationally cheaper. However, if your data is limited, or if you have a wide range of hyperparameters that could interact in complex ways, and particularly if you need a very reliable performance estimate before deploying an ECE system, then cross-validation is the safer bet. It minimizes the risk of making hyperparameter choices that only look good on one particular arbitrary split of data.

Pros of Nearest Neighbor Classifiers in ECE

Simplicity & Interpretability:
- Easy to understand and implement.
- Can provide insight into why a decision was made (by looking at neighbors).
- Useful for quick prototyping in ECE.
No Training Time:
- The model merely stores the training data.
- “Lazy learning” – computation happens during inference.
- Can be beneficial for systems where initial training needs to be minimal or dynamic.
Non-parametric:
- Makes no assumptions about the underlying data distribution.
- Can model complex decision boundaries.

Now, let’s consider where Nearest Neighbor classifiers shine, especially from an ECE perspective.

First, its simplicity is a huge plus. It’s easy to grasp, implement, and even explain. In engineering prototyping phases, a simple k-NN can provide a quick baseline without a lot of setup. Its interpretability is also valuable: if you want to know why an anomaly detector classified a sensor reading as ‘faulty’, you can inspect its nearest neighbors for clues.

Second, k-NN is a “lazy learner” – there’s virtually no explicit training phase. All it does during “training” is store the data. All the heavy lifting happens during prediction. This can be an advantage in scenarios where you need to quickly adapt to new data without retraining a complex model, or in resource-constrained embedded systems where intensive training might not be feasible on-device.

Finally, k-NN is non-parametric. It doesn’t assume your data fits a nice linear or Gaussian distribution, which is often a strong and incorrect assumption for real-world ECE data, especially from diverse sensors. It can adapt to very complex data patterns, making it surprisingly powerful in certain contexts.

Cons of Nearest Neighbor Classifiers in ECE

High Test-Time Cost:
- Classifying a new point requires comparing it to all training points.
- Critical for ECE: Unsuitable for real-time applications or embedded systems with large datasets.
Storage Requirements:
- Must store the entire training dataset.
- Problematic for memory-constrained devices (e.g., IoT, edge AI).
Curse of Dimensionality:
- Distances become less meaningful in high-dimensional spaces.
- This is a major issue for image data, as we’ll see next.

Despite its advantages, k-NN has significant drawbacks that often make it impractical for many real-world ECE applications.

The biggest hurdle is its high test-time computational cost. Each time you want to classify a new data point, the system needs to compare it against potentially tens of thousands or even millions of stored training examples. For real-time applications like autonomous driving, satellite imagery analysis, or high-speed communication systems, this latency is unacceptable. You can’t afford to take seconds or minutes to classify a single new input.

This also relates directly to storage requirements. k-NN needs to keep the entire training data readily accessible. For large datasets, this can mean Gigabytes or Terabytes of stored data, which is simply not feasible for memory-constrained embedded systems or many edge AI devices, where flash memory and RAM are precious.

Finally, and perhaps most critically for image data, we encounter the curse of dimensionality. As the number of features (like pixels in an image) increases, the concept of “distance” becomes less intuitive and less discriminative. All points tend to become equidistant from each other, making the “nearest neighbor” less meaningful. Let’s see this visually with images.

The Curse of Dimensionality & Image Data

Why pixel-based distances fail for images

Pixel-based L1 or L2 distances often correlate more with background and general color distribution than with semantic content.

Warning

The image on the left is the original. The three images next to it are all equally far away by L2 pixel distance. Notice: The L2 distance suggests they are equally similar, despite huge perceptual differences.

A truck and a horse can be “closer” if they share a similar background or lighting.
Semantic meaning (“what the image is”) is lost in raw pixel comparisons.

Here’s a stark visual example of the curse of dimensionality affecting image classification. The image on the left is the original. The three images next to it – which depict a car, a deer, and a bird – are all equally distant from the original image based on a pixel-wise L2 distance calculation. This immediately tells you that raw pixel distances are a poor measure of perceptual or semantic similarity for images.

You can clearly see that, to human eyes, the car and the deer are not equally similar to the original image. This problem is pervasive: a truck might be computationally “closer” to a horse if both are photographed against a similar green field, rather than to another truck on a brown road. The raw pixel values prioritize low-level features like color and texture over high-level semantic content, which is what we actually want to classify. This highlights a fundamental limitation of simple distance metrics for rich, high-dimensional data like images.

Visualizing Failure: t-SNE Embedding of CIFAR-10

t-SNE (t-Distributed Stochastic Neighbor Embedding) helps visualize high-dimensional data in 2D or 3D, preserving local neighborhood structures.

Important

Observation: Images nearby in this embedding (meaning they are pixel-wise similar) are clustered by background/color, not by their semantic class (e.g., “dog”, “cat”, “car”).

This image further drives home the point about pixel-based distances. Here, we see a t-SNE visualization of the CIFAR-10 dataset, which contains images from 10 different classes like cars, birds, cats, dogs. t-SNE is a technique that tries to map high-dimensional data into a lower-dimensional space (here, 2D) such that points that were close in the high-dimensional space remain close in the low-dimensional space.

What we observe is revealing: images that are clustered together here are not grouped by their actual semantic content (e.g., all cars together, all birds together). Instead, they are primarily grouped by factors like their background color, lighting conditions, or overall color distribution. For instance, you might see dogs and frogs clustered together if they were all photographed on a white background. This clearly demonstrates that raw pixel values are insufficient for capturing the complex, abstract features needed for robust image classification. This is a critical insight for ECE students: the raw sensor data isn’t always the best representation for ML; feature engineering or learning better representations is key.

Towards Smarter Features: Introduction to Convolution

Why ECE needs better feature extraction for images

The failure of pixel-wise distances indicates a need for more robust feature representations. ECE Connection: Image processing often involves filtering — a form of feature extraction.

Note

Convolution is a fundamental operation for extracting meaningful features from image and signal data.

How it works:

A kernel (small matrix/filter) slides over the input image.
At each position, it computes element-wise products and sums them.
This generates a new “feature map” highlighting specific patterns (edges, textures).

viewof k11 = Inputs.range([-1, 1], {step: 0.1, value: 0, label: "k[0,0]"});
viewof k12 = Inputs.range([-1, 1], {step: 0.1, value: -1, label: "k[0,1]"});
viewof k13 = Inputs.range([-1, 1], {step: 0.1, value: 0, label: "k[0,2]"});
viewof k21 = Inputs.range([-1, 1], {step: 0.1, value: -1, label: "k[1,0]"});
viewof k22 = Inputs.range([-1, 1], {step: 0.1, value: 4, label: "k[1,1]"});
viewof k23 = Inputs.range([-1, 1], {step: 0.1, value: -1, label: "k[1,2]"});
viewof k31 = Inputs.range([-1, 1], {step: 0.1, value: 0, label: "k[2,0]"});
viewof k32 = Inputs.range([-1, 1], {step: 0.1, value: -1, label: "k[2,1]"});
viewof k33 = Inputs.range([-1, 1], {step: 0.1, value: 0, label: "k[2,2]"});

The limitations of pixel-wise distances for images highlight a major challenge, and it points us towards the need for more intelligent feature extraction. This is a concept ECE students are deeply familiar with through digital signal processing and filtering. Instead of just comparing raw pixel values, what if we could extract features that do capture semantic meaning or structural properties like edges, corners, or textures?

This is where convolution comes in. It’s a fundamental operation that applies a small filter, or kernel, across an input image. At each step, the kernel performs element-wise multiplication with the underlying image patch and sums the results, producing a single output value for that region. By sliding this kernel across the entire image, we generate a new “feature map” where each value represents the presence (or absence) of the pattern the kernel is designed to detect.

On the right, you can interact with a simple convolution. We have a basic input image. You can adjust the 3x3 kernel values using the sliders. As you change the kernel, observe how the “Convolved Output” image changes. Try to create kernels that detect edges or simply blur the image. For instance, a common edge detection kernel (like a sharpening filter) highlights changes. This interactive demo is a microcosm of how modern deep learning models learn to extract complex features from raw data.

Summary

Image Classification: Assigning labels to images.
Nearest Neighbor (k-NN): Simple, non-parametric classifier that uses proximity in feature space. Key hyperparameters: k and distance metric.
Hyperparameter Tuning: Critical for generalizing to new data. Avoid test set overfitting.
Data Splits: Use Training, Validation, and Test sets. Test set is for final evaluation only.
Cross-Validation: Robust tuning for smaller datasets.
k-NN Limitations for Images: High test-time cost, storage, and curse of dimensionality (pixel-wise distances are inadequate).
Future Direction: Need for better feature extraction (e.g., Convolution).

To quickly recap what we’ve covered:

We started with image classification as a core machine learning problem with direct applications in ECE.
We then explored the Nearest Neighbor (k-NN) classifier. It’s simple, non-parametric, and works by finding the closest training examples. We identified its key hyperparameters: k and the distance metric.
A major focus was on the importance of hyperparameter tuning. We learned that improperly tuning these parameters can lead to test set overfitting, which gives a misleadingly optimistic view of our model’s performance.
The solution lies in proper data splitting into training, validation, and test sets. Remember the golden rule: the test set is reserved for a single, final evaluation.
For smaller datasets, we discussed cross-validation as a more robust method for hyperparameter tuning.
Finally, we critically examined the limitations of k-NN, especially for high-dimensional data like images. Its high test-time cost, storage needs, and the significant impact of the curse of dimensionality make simple pixel-wise comparisons inadequate.
This limitation highlighted the need for more sophisticated feature extraction techniques, paving our way to future topics like convolutional neural networks, which are very relevant in ECE for processing signals and images effectively.

Applying k-NN in Practice (ECE Guidelines)

If you consider k-NN (perhaps not for images, but for other sensor data):

Preprocess Data: Normalize features (e.g., zero mean, unit variance). Critical for distance-based methods.
Dimensionality Reduction: For very high-dimensional ECE data (e.g., spectral analysis, multi-sensor arrays), consider PCA, NCA, or Random Projections.
Data Splitting: Robustly split data into train/validation/test. Use cross-validation if data is sparse or hyperparameters are complex.
Hyperparameter Search: Systematically evaluate k and distance metrics on the validation set.
Accelerate Retrieval: For speed-critical ECE applications, explore Approximate Nearest Neighbor (ANN) libraries like FLANN.
Final Evaluation: After selecting the best hyperparameters, evaluate the model once on the untouched test set. Report this performance.

If you ever find yourself needing to apply k-NN in a practical ECE scenario, perhaps not for complex images but for other types of sensor data or feature vectors, here are some guidelines:

Data Preprocessing: Always normalize your features. This means ensuring they have zero mean and unit variance. This prevents features with larger numerical ranges from disproportionately influencing the distance calculations. In ECE, scaling sensor readings is a common practice.
Dimensionality Reduction: If your ECE data is inherently high-dimensional – for example, signals from many different channels or spectral data – consider techniques like Principal Component Analysis (PCA) or Neighborhood Components Analysis (NCA). These can project your data into a lower-dimensional space, mitigating the curse of dimensionality and making k-NN more effective. This is akin to feature extraction in signal processing.
Robust Data Splitting: As discussed, meticulously split your data. Use cross-validation when dealing with limited datasets or when you need highly reliable hyperparameter selection, which is often the case for safety-critical ECE systems.
Systematic Hyperparameter Search: Don’t just guess. Exhaustively search for the best k and appropriate distance metric.
Accelerate Retrieval: If your ECE application demands real-time responses despite the inherent slowness of k-NN at test time, look into libraries for Approximate Nearest Neighbor (ANN) search, like FLANN. These algorithms provide a trade-off between exactness and speed.
Final Evaluation: And always remember: one final, honest evaluation on your completely untouched test set to confirm your model’s real-world performance.

Machine Learning

Introduction to Nearest Neighbor Classifiers

What is Image Classification?

The Nearest Neighbor (NN) Principle

“Birds of a feather flock together.”

Defining “Closeness”: Distance Metrics

k-Nearest Neighbors (k-NN) Classifier

Beyond a single neighbor

The Challenge: Choosing Hyperparameters

Data Splitting: The Holy Trinity

Tuning Hyperparameters with a Validation Set

An example with k

When Validation Data is Small: Cross-Validation

Practical Considerations for Data Splits

Pros of Nearest Neighbor Classifiers in ECE

Cons of Nearest Neighbor Classifiers in ECE

The Curse of Dimensionality & Image Data

Why pixel-based distances fail for images

Visualizing Failure: t-SNE Embedding of CIFAR-10

Towards Smarter Features: Introduction to Convolution

Why ECE needs better feature extraction for images

Summary

Applying k-NN in Practice (ECE Guidelines)

Further Reading

An example with `k`