Understanding the concept of overfitting using the Higher Order Linear Regression

Ajay Reddy
Nov 12, 2021
3 min read

Updated: Nov 16, 2021

In this blog we will learn about data generation and sampling, formulating a higher order regression problem, curve fitting, what is overfitting and how to avoid it.

Overfitting:

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Underfitting:

A statistical model is said to have underfitting when it cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our predictive model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have fewer data to build an accurate model.

Building the model:

First we need to generate 20 data pairs of (x,y) where the values of 'x' are taken from uniform distribution between 0 and 1. And the values of 'y' are obtained from the function

y = sin(2*pi*x) + 0.1 * N where N is the noise from Gaussian distribution.

The true function of the data can be plotted as

Training the model:

Next we train our model on the train data and generate the weights of the model by using gradient descent. The code for polynomial regression model of order 3 is shown below.

Fitting the polynomial curve

After training our model we try to fit the polynomial curve for the train data. Below are the 4 graphs for polynomial regression of order 0, 1, 3 and 9.

After training our model let us test the model and compare the train vs test performance with respect to the degree of the model. From the below graph we can clearly see the root mean squared error of training set and testing set is declining as we increase the order of the regression. But after a certain point the test error is incresing with respect to the order. This is because the training model is overfitting the data and when tested the error is increasing.

Now let us increase the data to 100 and test the performance of the model. Here we fit the data by 9th order model. As we can see the fit the curve and the training data with the noise is perfectly fitted which results in overfitting. We can avoid this by regularization.

Regularization:

Regularization is a technique that adds information to a model to prevent the occurrence of overfitting. It is a type of regression that minimizes the coefficient estimates to zero to reduce the capacity (size) of a model. In this context, the reduction of the capacity of a model involves the removal of extra weights.

In this model we use ridge regression. In ridge regression, we have the same loss function with a slight alteration in the penalty term, as shown below:

Here lambda is the hyperparameter. To find our best model we find the train and test errors on various values of ' λ'. In our case the model's performance was less when λ value is 0 or 1. The best is observed when the value is λ/10000. The root mean square error with different values of λ is shown in the graph.

Contributions:

Generated weights using gradient descent for all the orders.
Built various models of different orders and plotted graphs on their performance.

Challenges faced:

The main challenge faced while building this model was generating the weights without using Scikit-learn. Here I generated the weights using gradient descent.

Finding the best model on different values of λ is another challenge faced here. I choose the best model by comparing the error values of different models.