Machine Learning

04 Classification: Introduction to Classification

Imron Rosyadi

00. Introduction to Classification

Classification vs. Regression: A Quick Review

Regression: Predicting Continuous Values

Predicts a numeric, continuous output.
Examples: House prices, temperature, signal strength.
Evaluation: Measures like Mean Squared Error (MSE).

Classification: Predicting Categories

Predicts a categorical, discrete output.
Examples: Spam/Not Spam, Object presence (cat/dog), Fault detection.
Evaluation: Focuses on correct vs. incorrect assignments.

What Does It Mean to Classify?

Classification model results are often returned as a list of confidences for each class. The model predicts the probability a data point belongs to each class.

Understanding Classification Confidence

Example Output:

Tiger: 0.96
Lion: 0.75
Cougar: 0.68

Tip

In ECE, such confidence levels are critical in systems like autonomous vehicles (identifying pedestrians with high certainty), medical image diagnosis, or anomaly detection in power grids.

Ambiguous Cases

Orange: 0.97
Grapefruit: 0.96
Sun: 0.45

Common Classification Models

Logistic Regression:
- A variation of linear regression, uses a sigmoid function for binary outcomes. Simple and interpretable.
Nearest Neighbors:
- Classifies based on the majority class among its closest data points. Intuitive, but sensitive to local data structure.
Decision Trees:
- Tree-like structure where each node tests a feature, leading to a class decision. Good for interpretability.
Random Forests:
- An ensemble of many decision trees, combining their predictions for robustness and better accuracy. Often very powerful.
Naive Bayes:
- Based on Bayes’ theorem, assumes feature independence. Useful for text classification and spam detection.
Deep Learning (Neural Networks):
- Multi-layered networks capable of learning complex patterns. Highly effective for image, speech, and sensor data classification.

The Machine Learning Classification Workflow

graph TD
    A["Raw Sensor/System Data"] --> B{"Data Preprocessing"}
    B --> C["Feature Engineering/Extraction"]
    C --> D{"Split Data <br> (Training & Testing Sets)"}
    D -- Training Data --> E[Choose & Train ML Model]
    E -- & Evaluation --> F[Model Evaluation & Tuning]
    D -- Testing Data --> F
    F -- Performance OK? --> G[Deploy Model to ECE System]
    F -- Needs Improvement --> E
    G --> H[New Live Data Input]
    H --> I["Real-time Prediction / Classification"]
    I --> J["Action/Decision <br> (e.g., Control Signal, Alert)"]

    style A fill:#f9f,stroke:#333,stroke-width:2px;
    style G fill:#bbf,stroke:#333,stroke-width:2px;
    style J fill:#fcf,stroke:#333,stroke-width:2px;
    style E fill:#ccf,stroke:#333,stroke-width:2px;
    style F fill:#dfd,stroke:#333,stroke-width:2px;

Classification Model Performance

Unlike regression, we can’t measure continuous “distance” to evaluate classification. Instead, we count correct vs. incorrect predictions. These counts form the basis for various performance metrics.

The Confusion Matrix

True Positive (TP): Model predicted positive, was actually positive.
False Positive (FP): Model predicted positive, was actually negative (Type I error).
False Negative (FN): Model predicted negative, was actually positive (Type II error).
True Negative (TN): Model predicted negative, was actually negative.

Accuracy

The fraction of all predictions that a classification model got right.
Simply the sum of True Positives and True Negatives, divided by the total.

\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]

Motivation for Precision

When the model predicted positive, how often was it correct?

What is the probability that a detected anomaly in our sensor data is an actual* anomaly, given that our model flagged it?*

Precision

The fraction of correct positive predictions out of all positive predictions made by the model.

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]

Motivation for Recall

Out of all the actual positive cases, how many did the model correctly identify?

What is the probability that our model will detect a ‘critical’ electromagnetic interference event, given that it actually occurred?

Recall

The fraction of correct positive predictions out of all actual positive cases.

\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

Precision vs. Recall: A Trade-Off

Increasing one often decreases the other.
The optimal balance depends on the application’s cost of FP vs. FN.

F1 Score

The harmonic mean of precision and recall.
High F1 indicates both precision and recall are high.

\[ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

F1 Score: Simplified

The F1 formula can be reduced to:

\[ F_1 = \frac{2 \cdot \text{TP}}{2 \cdot \text{TP} + \text{FP} + \text{FN}} \]

Interactive Metric Calculator

Adjust the True Positives, False Positives, False Negatives, and True Negatives to see how Accuracy, Precision, Recall, and F1 Score change.

viewof tp_val = Inputs.range([0, 100], {value: 1, step: 1, label: "True Positives (TP)"});
viewof fp_val = Inputs.range([0, 100], {value: 1, step: 1, label: "False Positives (FP)"});
viewof fn_val = Inputs.range([0, 100], {value: 8, step: 1, label: "False Negatives (FN)"});
viewof tn_val = Inputs.range([0, 100], {value: 90, step: 1, label: "True Negatives (TN)"});

Which Metric Do I Use?

Note

The answer is always: it depends on your specific ECE application!

Accuracy: Rarely a sufficient standalone metric, especially with imbalanced classes.
Precision: Crucial when False Positives are costly (e.g., discarding good products in QA, false alarms in security).
Recall: Critical when False Negatives are costly (e.g., missing a fault in critical infrastructure, failing to detect a disease).
F1 Score: A good general measure when you need to balance both precision and recall, particularly with imbalanced datasets.

Confusion Matrix Example

Scenario: A model predicts if a tumor is malignant.

(Positive Class: Malignant, Negative Class: Benign)

Confusion Matrix Example: Data

Model to predict if a tumor is malignant

Given these values:

TP = 1
FP = 1
FN = 8
TN = 90

Solution: Accuracy

\[\text{Accuracy} = \frac{1 + 90}{1 + 1 + 8 + 90} = \frac{91}{100} = 0.91\]

Solution: Precision

\[\text{Precision} = \frac{1}{1 + 1} = \frac{1}{2} = 0.50\]

Solution: Recall

\[\text{Recall} = \frac{1}{1 + 8} = \frac{1}{9} \approx 0.11\]

Solution: F1 Score

\[ F_1 = \frac{2 \cdot 0.50 \cdot 0.11}{0.50 + 0.11} = \frac{0.11}{0.61} \approx 0.18 \]

Or, using the simplified formula:

\[ F_1 = \frac{2 \cdot 1}{2 \cdot 1 + 1 + 8} = \frac{2}{11} \approx 0.18 \]

Solution: F1 Score

Weather Prediction

Scenario: Predict “Rain” or “No Rain”.

Create a Confusion Matrix from this data.

Your Turn: Calculate Metrics

Now that you have constructed the confusion matrix for the weather prediction:

Accuracy = ?
Precision = ?
Recall = ?
F1 Score = ?

Solution: Weather Prediction

Confusion Matrix:

TP (Actual Rain, Predicted Rain): 2
FP (Actual No Rain, Predicted Rain): 2
FN (Actual Rain, Predicted No Rain): 2
TN (Actual No Rain, Predicted No Rain): 1

Metrics:

Accuracy: (2+1) / (2+2+2+1) = 3/7
Precision: 2 / (2+2) = 2/4
Recall: 2 / (2+2) = 2/4
F1 Score: 2/4

Graphical Measurements for Classification

Beyond single scalar metrics, graphical tools offer deeper insights into model performance across different decision thresholds.

Precision vs. Recall Curve

Plots Precision against Recall for different threshold values.
Helps select an optimal operating point based on FP/FN costs.

Receiver Operating Characteristic (ROC) Curve

Plots True Positive Rate (Recall) against False Positive Rate.
Helps compare models across all possible thresholds.

ROC Curve: True Positive Rate (TPR) / Recall

\[ \text{TPR (Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

ROC Curve: False Positive Rate (FPR)

\[ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} \]

FPR is 1 minus the True Negative Rate (TNR, Specificity).
Measures how many actual negative examples were falsely predicted as positive.

Interpreting the ROC Curve

TPR (Y-axis): Proportion of actual positives correctly identified.
FPR (X-axis): Proportion of actual negatives incorrectly identified as positive.
Dotted Line: Represents a random classifier (AUC = 0.5).
Area Under Curve (AUC): Single scalar metric to summarize the curve.
- AUC near 1.0 indicates excellent discriminative power.
- AUC near 0.5 suggests poor or random classification.

01. Binary Classification

Binary Classification: Two Outcomes

Yes or No

Predicts one of two discrete values or states.
Commonly encoded as 0 or 1.
Examples:
- Spam / Not Spam
- Fault / No Fault
- Signal Present / Signal Absent
- Pass / Fail for product testing

Binary Classification: Common Models

Logistic Regression: Transforms linear regression output into a probability (0-1).
Decision Trees & Random Forests: Can naturally split data into two categories.
Support Vector Machines (SVM): Finds an optimal hyperplane to separate classes with the largest margin.
Bayesian Networks: Probabilistic graphical models used for classification.
Neural Networks: Highly versatile, learn complex non-linear boundaries.

Binary Classification: Logistic Regression Example

Classification

Finds a logistic function to separate two classes.
Outputs a probability value (0-1) which is then thresholded for classification.
Relatively easy to interpret.

Lab Preview: Fruit Classification

Fruit Classification

Objective: Differentiate between oranges and grapefruit.
Dataset: Contains features like weight, size, and color.
Model: We will build a logistic regression model.

Lab Preview: Hyperparameter Tuning with Grid Search

search = GridSearchCV(model, {
  'learning_rate': [1e-3, 1e-4],
  'max_iter': [10000, 15000],
  'C': 1,
})

Grid Search: Systematically explores a combination of hyperparameters.
Helps find the best parameter settings for your model.
Can be computationally intensive, especially with many parameters.

Lab Preview: Confusion Matrix Generation

Confusion Matrix

You will generate your first confusion matrix.
Visualizing TP, FP, FN, TN for your fruit classifier.

Your Turn: Binary Classification Lab

Let’s apply these concepts and build a binary classifier!

03. Multiclass Classification

Multiclass Classification: Many Outcomes

Classification problems with more than two classes.
Examples:
- Digit recognition (0-9)
- Speech command recognition (e.g., “activate,” “mute,” “volume up”)
- Modulation scheme identification (e.g., BPSK, QPSK, 16-QAM)
- Component type classification

Multiclass Strategies: One-vs-All (OvA) & One-vs-One (OvO)

One-vs-All (OvA):
- Trains k binary classifiers for k classes.
- Each classifier distinguishes one class from all others.
- Final prediction is the class with the highest confidence.
One-vs-One (OvO):
- Trains k * (k-1) / 2 binary classifiers.
- Each classifier distinguishes one class from another specific class.
- Final prediction is derived by a voting scheme among classifiers.

Lab Preview: The Iris Dataset

Classic ML Dataset: Widely used for multiclass classification.
Features: Sepal length, sepal width, petal length, petal width.
Target: Three species of Iris flowers (Setosa, Versicolor, Virginica).

Lab Preview: Cross-Fold Validation

Shuffle data.
Split into k groups (folds).
Iterate k times:
- Use one fold as test set.
- Use remaining k-1 folds as training set.
- Train model on training data.
- Evaluate on test data and record performance.
Average performance metrics across k iterations.

Lab Preview: Wine Producer Identification

Challenge: Identify wine producers based on chemical properties.
Dataset: Chemical analysis of different wines.
Your Task: Apply your ML skills with minimal guidance.

Your Turn: Multiclass Classification Lab

Time to apply your knowledge to solve multiclass problems!

Classification with TensorFlow

Dataset: UCI Heart Disease

Predicting the presence of heart disease

Note

This will be a binary classification problem: 0 = does not have heart disease 1 = has heart disease

Dataset: Features

Feature	Description
`age`	Age in years.
`sex`	Sex (0 = female, 1 = male).
`cp`	Chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic).

Dataset: Features (continued)

Feature	Description
`trestbps`	Resting blood pressure in Hg.
`chol`	Serum cholesterol in mg/dl.
`fbs`	Is fasting blood sugar > 120 mg/dl (0 = false, 1 = true).
`restecg`	Results of a resting electrocardiograph (0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular hypertrophy).

Dataset: Features (continued)

Feature	Description
`thalach`	Max heart rate.
`exang`	Exercise induced angina (0 = no, 1 = yes).
`oldpeak`	Measurement of an abnormal ST depression.
`slope`	Slope of peak of exercise ST segment (1 = upslope, 2 = flat, 3 = downslope).

Dataset: Features (continued)

Feature	Description
`ca`	Count of major blood vessels colored by fluoroscopy (0, 1, 2, 3, or 4).
`thal`	Presence heart condition (0 = unknown, 1 = normal, 2 = fixed defect, 3 = reversible defect).

ca is a count of major blood vessels colored by fluoroscopy. The values are 0, 1, 2, or 3, and are limited by biology, though occasionally 4 appears due to data quirks. thal relates to a heart defect, describing if it exists and its nature. You might notice the values on the slides for some of these columns differ from the documentation. For instance, the documentation for ‘ca’ states that the values range from 0-3, but there are 4s in the data. And the documentation for ‘thal’ says that the values are 3, 6, and 7, but the actual values in the data are 0, 1, 2, and 3. The takeaway from this is that you should always read the documentation, but you should also always look at the data and verify that the documentation is accurate. When there are questions, you should do research. If you are in contact with the source of the data, ask for clarification. Though documentation is great and can really help in data science, the dataset itself is the actual ground truth. This is a critical lesson for ECE engineers handling data from new sensors or instruments.

The Model: Output Layer Activation

    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)

For binary classification, the final layer typically has 1 neuron.
Uses a sigmoid activation function.
Output range [0.0, 1.0] interpreted as prediction confidence.
Threshold determines final class.

The model in this lab won’t look too different from the TensorFlow Keras models we built for regression analysis. The primary difference is the final layer in the model. We want to create a binary prediction that will let us know if a patient has heart disease or not. If we stick with a ReLU activation function for the output, then there is no bound for the maximum output value, so it would be impossible to understand what the prediction confidence is. Instead, we’ll use an activation function that limits the output value. In this particular lab, we use a sigmoid function, so the output is limited to the range of 0.0 to 1.0. This output is then a measure of confidence that a patient has heart disease (since “has heart disease” is the 1.0 value). We can then decide how much confidence it takes to classify the patient as having heart disease. The choice of threshold is very important for model performance, and remember that we can measure performance at different thresholds with an ROC curve. This is directly applicable to digital circuit design for ML acceleration, where the choice of activation impacts hardware implementation efficiency.

The Model: Loss Function & Optimizer

    model.compile(
        loss='binary_crossentropy',
        optimizer='Adam',
        metrics=['accuracy']
    )

Loss Function: binary_crossentropy is standard for binary classification.
Optimizer: Adam is an adaptive learning rate optimization algorithm.
- Adjusts learning rate for each parameter, often faster convergence.

How we measure loss is also very important. For binary classification problems, we need to use binary cross-entropy, which is specifically designed to quantify the difference between two probability distributions. Although we’ve talked a lot about using gradient descent for optimization, there are other methods as well. Adam is one of these methods. Adam uses an adaptive learning rate. That is, it uses a different learning rate for each of the different parameters in the model. This differs from stochastic gradient descent which uses a single learning rate for all parameters. While Adam often offers faster convergence, selecting the right optimizer is a non-trivial part of ECE machine learning, influencing training time and model accuracy on specialized hardware. A lot of research is being done to understand the conditions under which different optimizers perform better.

The Model: Early Stopping

    tf.keras.callbacks.EarlyStopping(
        monitor='loss',
        min_delta=1e-3,
        patience=5,
    )

Purpose: Prevents overfitting and reduces training time.
Mechanism: Stops training when a monitored metric (e.g., loss) stops improving significantly.
- monitor='loss': Watches the validation loss.
- min_delta=1e-3: Minimum change in the monitored quantity to qualify as an improvement.
- patience=5: Number of epochs with no improvement after which training will be stopped.

Your Turn: TensorFlow Lab

Now, it’s your turn to perform binary classification using TensorFlow Keras and deep neural networks!

04. Classification Project: Predicting Titanic Survivors

The Titanic Shipwreck Challenge

Goal: Achieve a high accuracy score in predicting passenger survival.
Application: A canonical challenge for applying binary classification.

Review: Types of Classification

What types of classification have we learned about?

Review: ML Tools for Classification

What tools have we learned about for classification?

Review: Evaluation Metrics

What metrics have we learned for evaluating classification models?

Review: Other Useful Techniques

What other useful techniques have we learned, and what are they used for?

Classification Project: The Data

Column	Type	Description
`Survived`	number	1 or 0 ( target )
`Name`	string	Passenger name
`Pclass`	number	Ticket class
`Sex`	string	male or female
`Age`	number	Passenger age
`SibSp`	number	# of siblings/spouses on board
`Embarked`	string	Port of Embarkation

Classification Project: Kaggle Competition

Titanic: Machine Learning from Disaster

Engage with a global community of ML practitioners.
Upload your results to compare your model’s performance.

Classification Project: Your Turn

Exploratory Data Analysis (EDA):
- Understand the data, identify obvious problems, and perform initial cleaning.
- Consider pros/cons of using ML for this problem.
Model Building & Evaluation:
- Choose your model (scikit-learn or TensorFlow).
- Train and evaluate your model, discussing chosen metrics.
Make Predictions & Upload to Kaggle:
- Generate predictions for the test dataset.
- Submit your predictions to the Kaggle competition.
Iterate on Your Model:
- Tweak hyperparameters, try different models, explore new features.
- Discuss your methodical approach to improvement.
- Research and compare with other solutions for deeper insights.

It is now your turn to perform a classification from end-to-end. The lab you are about to be given is divided into four primary parts. In the first section, you’ll acquire and explore the data. Here we expect you to write code and prose about the data. Does the data have obvious problems? Do any model-independent changes need to be made to the data? EDA is the place to reason about and perform these tasks. This is also a good time to think about the pros and cons of using machine learning to solve this problem. In the next section, you will build and evaluate your model. You may choose to use scikit-learn or Tensorflow. You may even try multiple approaches and compare your results. Here you should also evaluate your model and discuss your particular evaluation metrics, including why you chose them and what they say. Finally, you will make predictions on the features found in the test.csv file and upload them to Kaggle using the Kaggle API. Your lab should discuss your predictions as well as your Kaggle results. Last but not least, iterate on your model. Tweak hyperparameters, and see if you can improve your model. Discuss your method for changing specific hyperparameters. Be thoughtful and methodical; don’t just do it at random! Since this is a popular Kaggle dataset and competition, research other users’ solutions. Try looking at solutions that both do and don’t use ML, and discuss their relative merits. Take your time. Experiment. Don’t be afraid to throw away some work along the way. This iterative process is crucial for ECE engineers developing and refining complex systems.