Machine Learning

03 Regression: Fundamentals, Implementation, and Evaluation

Imron Rosyadi

00. Introduction of Regression

Regression

Regression Introduction Image

Has anyone seen a crime show or heard of an investigation where they used a footprint to determine a suspect’s height? It’s a tactic frequently mentioned in connection to forensics, but does it actually work? Let’s try it out!

Exercise (15 minutes): Use your own shoe size and height as the suspect’s, and tell students to keep in mind that US men’s size equals US women’s size - 2. Don’t tell them the suspect’s height, but tell them the suspect’s shoe size and tell them that their task will be to guess the suspect’s height.

Split the students into groups of ~6 each and give each group a sheet of graph paper.

Ask the groups to plot each group member’s shoe size on the x-axis and height on the y-axis. What do they think is the suspect’s height based on the suspect’s shoe size?

Then have the groups share data, so each has a plot of the whole class’s information. Make another guess per group. Does anyone come close?

They should theoretically have better guesses with more data, but shoe size might not actually be well correlated to height, so they might not.

Mathematical Model

Mathematical Model for Linear Regression

Features Go In, Targets Come Out

Features and Targets in Machine Learning

What is the Machine “Learning?”

Weights and Biases in Linear Regression

Multiple Features

Multiple Features in Regression

Machine Learning Process

Infer/Predict/Forecast
Calculate Error/Loss/Cost
Train/Learn (Update parameters)
Iterate/Repeat (until some stopping condition)

Note

This iterative cycle is fundamental to how most machine learning algorithms “learn” from data.

Machine Learning Process Flow

graph TD
    A[Start] --> B{Data Input};
    B --> C[Predict Target];
    C --> D[Compare to Actual Target];
    D --> E[Calculate Error/Loss];
    E --> F["Update Model Parameters <br> (Weights & Bias)"];
    F --> G{Stopping Condition Met?};
    G -- No --> C;
    G -- Yes --> H[Model Converged];

Predict the Selling Price of a House

House Price Data

Predict the Price of a House Using the Machine Learning Process

House Price Scatter Plot

Predict the Price of a House Using the Machine Learning Process

Initial Guess for Line

Predict the Price of a House Using the Machine Learning Process

Line from Initial Guess

Predict the Price of a House Using the Machine Learning Process

Predicted Values Example

Predict the Price of a House Using the Machine Learning Process

Actual vs Forecasted Values

Predict the Price of a House Using the Machine Learning Process

Error Calculation Illustration

Predict the Price of a House Using the Machine Learning Process

Updated Line Illustration

Interactive: Tuning Our Regression Model

Adjust the weight (slope) and bias (intercept) to fit the data and minimize the Mean Squared Error (MSE).

viewof weight_m = Inputs.range([-10, 10], {value: 1, step: 0.1, label: "Weight (m)"});
viewof bias_b = Inputs.range([100, 200], {value: 160, step: 1, label: "Bias (b)"});

Error/Loss/Cost Functions

Common Loss Functions

Now let’s look at a few common loss/cost functions. Remember we use these functions to determine the error that results from a particular set of weights and biases. These are not the only loss functions, but they are very common.

L1 Loss (Least Absolute Deviations or LAE): L1 is resistant to outliers in the data (i.e. robust). If your data has outliers that can be ignored, then L1 is a good choice. If it is important to pay attention to any and all outliers, the method of least squares is a better choice.
L2 Loss (Least Squares): Generally, L2 loss is preferred to L1, but when outliers are present in the data, then L2 may not perform well. The reason for this is because we are squaring the difference between the actual target and the predicted target. So if the error is large (in the case of an extreme outlier), then the error function will overcompensate.
Mean Squared Error (MSE): MSE is the average of the squared differences between predicted targets and actual targets. Due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions (similar to L2). MSE also has nice mathematical properties which make it easier to calculate gradients, which are used to update the model parameters (weights and biases). It’s widely used in ECE for tasks like signal reconstruction error or system identification.

Housing Example

Housing Example Data Table

Housing Example (L1 Loss)

L1 Loss Calculation

Housing Example (L2 Loss)

L2 Loss Calculation

Housing Example (MSE)

MSE Calculation

Computer vs. Human Jobs

The computer’s job:

Start with an arbitrary guess of parameters.
Tweak these parameters to reduce loss.
The less the loss is changing, the less the value should be tweaked.

The human’s job:

Choose the learning rate, a constant value which scales how far we tweak the value during each iteration.

Important

Learning rate is a hyperparameter - not a parameter in the actual model.

Gradient Descent

Gradient Descent Illustration

How does the model “iteratively” update its parameters? We can think of our goal as an optimization problem, where we’d like to optimize (minimize) a loss function. Machine learning models then use an “optimizer,” an algorithm to perform that optimization.

The most common optimizer is gradient descent, where the model starts by picking random values for each parameter. It then changes each in the direction that reduces loss the most. On each iteration or “step,” the model should get closer to the minimal loss until it “converges,” or reaches a point where the loss isn’t changing much between steps. (Usually this is based on some threshold, like the loss function changing by less than 0.001 between steps.) Since this isn’t a closed-form solution, gradient descent isn’t guaranteed to converge to the absolute lowest loss possible. There are more sophisticated optimizers that can sometimes do better.

You can control gradient descent by choosing the learning rate, which determines how much you tweak each parameter on each step. We call this a hyperparameter: a value you can change to change model performance, but one that isn’t “learned” by the model. This is analogous to tuning PID controller gains in an ECE system to optimize stability and response time.

Linear Algebra Notation for \(y=mx+b\)

Linear Algebra Notation for Single Feature

Linear Algebra Notation for \(y=mx+b\)

Compact Linear Algebra Notation

Multiple Regression (i.e. Multiple Features)

Multiple Features in Multiple Regression

Multiple Regression Notation

Multiple Regression Compact Notation

Closed Form, Exact Solution

\[ \theta = (X^{T} \cdot X)^{-1} \cdot X^{T} \cdot y \]

Good for small datasets
Finds optimal solution
Can be computationally expensive
Requires an invertible matrix

Batched Data

Break data into smaller batches.

We’ll use a new batch on each learning step.
New hyperparameter batch size controls how much data is used for each learning step.

Batched Data for Training

Hyperparameters We Care About

Hyperparameter Tuning Guidelines

01. Introduction to scikit-learn

scikit-Learn

Scikit-learn.org is the primary website for the scikit-learn project. Here you will find information pertaining to scikit-learn, including instructions on installation, documentation, and even the project source code.

Let’s take a few moments to look around the project website.

Exercise (10 minutes): Either navigate to scikit-learn.org on your own computer and present your computer screen to the students (recommended), or ask them to open their laptops to scikit-learn.org. Take the time to point out the following elements on the website: * The classification, regression, clustering, dimensionality reduction, model selection, and preprocessing sections on the main page. These represent core groupings of features provided by scikit-learn. * The top-page navigation with links on how to install the toolkit, documentation, and examples. * The banner on the upper right corner that says “Fork me on GitHub.” This leads to the source code. * When you click the ‘Documentation’ drop-down in the upper navigation, it tells you the current stable version and has a link to ‘All available versions.’ Tell students to be sure to check the version of scikit-learn they’re working with once they start the lab. * The ‘Examples’ linked in the top navigation are not just API usage examples; they also contain some interesting machine learning insights.

Datasets

Loading

Fetching

Generating

Bunches

Estimators

Transformers

Pipelines

Metrics

Your Turn

02. Linear Regression with scikit-learn

Linear Regression

Linear Regression Fit Animation

scikit-learn: Using Linear Algebra

Note

This is not a learning algorithm that iteratively optimizes. It’s a direct, closed-form solution.

Optimizers

(Batch) Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-Batch Gradient Descent

Recall that our overall goal is to learn parameters that minimize a particular cost/loss function. There are many ways to perform this optimization, but gradient descent is a very popular choice. At a high level, we use the gradient (i.e. the derivative/slope) of the cost function to determine the direction to adjust the parameters. In other words, if we want to get to the bottom of the hill, we walk in the direction of the steepest downward slope.

Batch Gradient Descent: The entire dataset is used to calculate the gradient during each iteration of training. This can be computationally expensive for large datasets.
Stochastic Gradient Descent (SGD): We randomly choose one data point from our training set to compute the gradient at each iteration (i.e. we use a batch-size of 1). This introduces more noise but can be much faster per iteration.
Mini-Batch Gradient Descent: A middle ground between batch and stochastic gradient descent, using a fixed number of training samples (greater than 1, but less than the entire dataset) to compute the gradient during each iteration. This combines the speed of SGD with some stability.

These various methods mirror iterative optimization techniques used in ECE for adaptive systems or circuit design (e.g., tuning filters).

scikit-learn: Stochastic Gradient Descent

scikit-learn: SGD Hyperparameters

There aren’t really any hyperparameters to tune for LinearRegression (the closed-form solution). SGDRegressor, however, has many hyperparameters that can be tuned. You can see some of those hyperparameters in use here:

max_iter: The maximum number of passes over the training data (epochs). Sometimes training more can improve performance.
n_iter_no_change: Manages “early stopping.” If loss doesn’t meaningfully improve for this many iterations, training stops.
tol: The tolerance for loss improvement to trigger early stopping.
learning_rate: Affects how much the weights are adjusted during each step. ‘adaptive’ means it adjusts dynamically. Different learning rate schedules are crucial in ECE control systems for stability or convergence speed.

There are many more hyperparameters that can be found in the SGDRegressor documentation, allowing for fine-grained control over the optimization process, similar to tuning intricate analog circuits.

scikit-learn: SGD `partial_fit`

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor()
sgd_reg.partial_fit(X_1, y_1)
sgd_reg.partial_fit(X_2, y_2)
...
sgd_reg.partial_fit(X_n, y_n)
sgd_reg.coef_, sgd_reg.intercept_

Loss

Mean Squared Error

\[ MSE = \frac{1}{n} \sum_{n=1}^{n}(y_{i} - \hat{y_{i}})^{2} \]

Train/Validate, Test

Train, Validate, Test Split

This lab will also be the first time we’ll need to split our data for model training.

When we train a model, we could use all of the data that we have. When we do that, however, we risk overfitting the model to our data, and we lose the ability to test our model on “new” data it hasn’t seen. The model might become really good at making predictions that look like the data that it has already seen, but really bad at generalizing.

For this reason we typically hold out some of the data and don’t use it to train the model at all. We keep this “test set” of data and use it only to evaluate the model after training has completed. We pass the trained models the features in the test set, get the predictions from the model, and then calculate the difference between the predictions and the actual values.

The concept of a validation set is also important. The validation set is used during training to let the optimizer evaluate the model. The loss calculated with the validation set directly affects decisions the model makes. This practice is akin to rigorous testing of an ECE prototype against a test bench and then against real-world scenarios before deployment.

Train/Validate, Test, Validate

Double Validation Process

The holdout data story gets more complicated when hyperparameter tuning is involved. When you tune hyperparameters, you’ll still have the same training and validating data available during model fitting. Then you’ll use your test data to see how well the model generalizes. With that said, if you then change hyperparameters and test again, you risk over-tuning hyperparameters to the test data set.

In order to prevent this, many data scientists also keep another holdout dataset called the validation dataset (often called a “holdout validation set” or “dev set” to distinguish from the training-time validation set). This dataset is used for one final check after you have selected hyperparameters and before final deployment. This multi-layered validation ensures the model’s robustness, similar to multiple stages of verification and validation in ECE product lifecycle management.

Your Turn

03. Regression Quality

Coefficient of Determination (\(R^2\))

\[ SS_{res} = \sum_{i}(y_i - \hat{y_i})^2 \] \[ \bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_{i} \] \[ SS_{tot} = \sum_{i}(y_{i}-\bar{y})^2 \] \[ R^{2} = 1 - \frac{SS_{res}}{SS_{tot}} \]

The coefficient of determination, denoted \(R^2\), is one of the most important metrics in regression. It tells us how much of the data is “explained” by the model. This is very useful in ECE to understand how well a model captures the underlying physics or behavior of an electronic component or system.

Before we can define the metric itself, we need to define a few other key terms: * Residual sum of squares (\(SS_{res}\)): The summation of the square of every difference between the target value \(y_i\) and the predicted value \(\hat{y_i}\). * Total sum of squares (\(SS_{tot}\)): The sum of the squared differences between each value \(y_i\) and their mean \(\bar{y}\).

Given these, we can calculate \(R^2\). The \(R^2\) score measures how well the actual variance from \(x\)-values to \(y\)-values is represented in the variance between the \(x\)-values and the predicted \(\hat{y}\)-values.

Typically, this score ranges from 0 to 1, where 0 is bad and 1 is a perfect mapping. However, the score can also be negative. This happens if a simple horizontal line through the mean \(\bar{y}\) performs better than your regression, meaning your model is truly terrible and needs serious revision. For values in the range 0-1, interpreting the \(R^2\) is more subjective. The closer to 0, the worse your model is at fitting the data. And generally, the closer to 1, the better. But you also don’t want to overfit, which we’ll discuss later.

Mean Squared Error (MSE)

\[ MSE = \frac{1}{n} \sum_{n=1}^{n}(y_{i} - \hat{y_{i}})^{2} \]

Root Mean Squared Error (RMSE)

\[ RMSE = \sqrt{\frac{1}{n} \sum_{n=1}^{n}(y_{i} - \hat{y_{i}})^{2}} \]

Note

RMSE returns the error in the original units of the target variable for easier interpretation.

Mean Absolute Error (MAE)

\[ MAE = \frac{1}{n} \sum_{n=1}^{n}(|y_{i} - \hat{y_{i}}|) \]

Predicted vs. Actual Plots

Good Predicted vs Actual Plot

Predicted vs. Actual Plots (Positive Bias)

Predicted vs Actual Plot with Positive Bias

Residual Plots

Good Residual Plot

Residual Plots (Bias Example)

Residual Plot with Bias

Your Turn

04. Polynomial Regression & Overfitting

Outline

Recall Linear Regression
Polynomial Regression: What is it and how is it different (or not so different)?
Caution! Dangers of polynomial regression!

Tip

Understanding when and how to use polynomial regression is key to modeling complex system behaviors in ECE.

Linear Regression Fit Animation

Dataset for Polynomial Regression

Linear Fit on Non-linear Data

Polynomial Fit on Non-linear Data

Polynomial Equations

Examples of Polynomial Equations

What is the Original Curve?

Original Polynomial Curve

Polynomial Regression Process

Polynomial Regression = Linear Regression

Turn the original polynomial regression problem into a polynomial regression problem with multiple features.

Overfitting

Dataset for Overfitting Example

Overfitting Demonstration

Overfitting Analogy: Clothing Fit

Well-fitting shirt

Overfitted clothing

Underfitted clothing

Just right fit

Illustration of Overfitting Regression

How Do We Avoid This?

Avoiding Overfitting

Simpler polynomial
More training data
Dropping out some training data (e.g., regularization)
Overfitting penalties (regularization)

Important

Regularization is a key technique to manage overfitting by adding penalties to model complexity.

Here are some of the most common ways to avoid overfitting:

Using a polynomial function with fewer degrees: If your model is introducing enough curvature to cross most training data points, then use fewer degrees in your polynomial. This reduces model complexity.
More training data: As your dataset grows in size, it will likely also grow in diversity and create a model that is less overfitted because it sees more variations. This can be challenging for ECE systems where data collection is expensive.
Dropping out some training data (or features): Sometimes removing less informative features or artificially simplifying the input can help.
Overfitting penalties (regularization): There are strategies for adding penalties to the model that make even a high-degree polynomial less likely to overfit. Some common strategies are called Lasso, Ridge, and ElasticNet. We’ll look at each of these more closely, and you’ll experiment with them in your lab exercise. Regularization can be seen as introducing a “cost” for excessive model complexity.

Regularization

Recall: Mean Squared Error

Mean Squared Error Formula Breakdown

As a reminder, this is the equation of a common loss function, the mean squared error. Let’s break down its components in the context of our linear algebra notation:

Line 1: (y_true - y_pred) is the error between the true target and the predicted target from the model.
Line 2: Recall that y_pred came from a linear regression equation, which can be written in matrix notation as X * theta.
Line 3: Completing the matrix multiplication and writing the multivariate regression formula using detailed notation:
- n = number of rows in the training data
- p = number of coefficients in the equation (also number of features + bias term)
- y_true = true target value
- theta_0 = intercept (bias)
- theta_j = coefficients (weights) of the polynomial equation
- x_i = feature values

This foundation helps us understand how regularization will modify this loss function.

Lasso (L1) Regularization

Lasso (L1) Regularization Formula

What does it mean to shrink coefficients? It effectively means to increase the value of the loss function as the coefficients get larger.

Lasso is L1 regression. This means that it adds the sum of the absolute values of the coefficients (multiplied by a regularization parameter \(\lambda\)) to the original MSE loss function. We can see that by adding lambda * |theta_j| (a positive number), the cost function is always slightly larger than the regular MSE loss function for any non-zero theta_j. This regularization term forces the model to make theta_j values smaller.

Smaller coefficients make the model “more linear” or less sensitive to specific features. For example, in a high-degree polynomial, if theta_2 and theta_3 become very small, their terms have less impact on the curve.

LAGSO is an acronym for “Least Absolute Shrinkage and Selection Operator.”

Due to the absolute value function, L1 regularization has a unique property: it can drive some coefficients exactly to zero. This means L1 regularization can perform feature selection, effectively identifying and ignoring less important features. This is highly useful in ECE for reducing model complexity and identifying critical system parameters.

Ridge (L2) Regularization

Ridge (L2) Regularization Formula

Ridge regularization looks similar to Lasso, but instead of appending the sum of absolute values of coefficients to the loss function, it appends the sum of their squares. This is also multiplied by a regularization parameter \(\lambda\).

Ridge regularization also shrinks coefficients towards zero, but it tends to shrink them proportionally. Unlike Lasso, L2 regularization rarely drives coefficients exactly to zero; it prefers to reduce their magnitude without fully eliminating them.

Since the L2 norm (sum of squares) is differentiable, problems using this method can be solved efficiently by gradient descent, making it computationally attractive. In ECE, Ridge regression helps stabilize models in the presence of multicollinearity (highly correlated features), which is common in sensor data fusion.

ElasticNet (L1 + L2)

Which Regularization Is Best?

Caution

The choice of regularization often depends on the specific dataset and problem. Experimentation is key!

It depends on the specific characteristics of your data and your modeling goals:

L1 regularization (Lasso):
- Can drive coefficients to zero, and it tends to produce a sparse model (i.e., many coefficients are zero).
- Its unique ability to eliminate irrelevant features makes it useful for feature selection.
L2 regularization (Ridge):
- Is less likely to drive coefficients exactly to zero; it shrinks them proportionally.
- Instead, it tends to produce a more dense model (most coefficients are small but non-zero).
- Effective at handling multicollinearity (highly correlated features).
ElasticNet:
- As a compromise between L1 and L2, it often works well in situations where there are multiple correlated features or when feature selection is important.

It is probably worth experimenting with each method to see which works best for your particular model, considering performance, interpretability, and computational resources, especially in diverse ECE problem domains.

Machine Learning

00. Introduction of Regression

Regression

Mathematical Model

Features Go In, Targets Come Out

What is the Machine “Learning?”

Multiple Features

Machine Learning Process

Machine Learning Process Flow

Predict the Selling Price of a House

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Predict the Price of a House Using the Machine Learning Process

Interactive: Tuning Our Regression Model

Error/Loss/Cost Functions

Housing Example

Housing Example (L1 Loss)

Housing Example (L2 Loss)

Housing Example (MSE)

Computer vs. Human Jobs

Gradient Descent

Linear Algebra Notation for \(y=mx+b\)

Linear Algebra Notation for \(y=mx+b\)

Multiple Regression (i.e. Multiple Features)

Multiple Regression Notation

Closed Form, Exact Solution

Batched Data

Hyperparameters We Care About

01. Introduction to scikit-learn

scikit-Learn

Datasets

Loading

Fetching

Generating

Bunches

Estimators

Transformers

Pipelines

Metrics

Your Turn

02. Linear Regression with scikit-learn

Linear Regression

scikit-learn: Using Linear Algebra

Optimizers

scikit-learn: Stochastic Gradient Descent

scikit-learn: SGD Hyperparameters

scikit-learn: SGD partial_fit

Loss

Mean Squared Error

Train/Validate, Test

Train/Validate, Test, Validate

Your Turn

03. Regression Quality

Coefficient of Determination (\(R^2\))

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Predicted vs. Actual Plots

Predicted vs. Actual Plots (Positive Bias)

Residual Plots

Residual Plots (Bias Example)

Your Turn

04. Polynomial Regression & Overfitting

Outline

Polynomial Equations

Polynomial Regression = Linear Regression

Overfitting

Overfitting Analogy: Clothing Fit

How Do We Avoid This?

Avoiding Overfitting

Regularization

Recall: Mean Squared Error

Lasso (L1) Regularization

Ridge (L2) Regularization

ElasticNet (L1 + L2)

Which Regularization Is Best?

scikit-learn: SGD `partial_fit`