All Courses

Gradient Descent in Neural Networks

Neha Kumawat

4 years ago

Gradient Descent
In my last article, I showed you that, how actually a neural network works. Let me give you a quick recap so that you get familiar if you have forgotten something.
The purpose of this article is to introduce you with the idea of gradient descent, which tells how neural networks learn. Then after that, we are going to see how this particular network performs and what those hidden layers of neurons are actually doing.
So, let’s start.
In the previous article, we have seen the example of handwritten digit recognition.
These digits are provided on a 28 by 28-pixel grid each pixel with some grayscale value between 0 & 1 which determines the activations of all 784 neurons in the input layer of the network, which then determines the activation for each neuron in the next layers based on a weighted sum of all the activations in the previous layer and a special number called a bias, then you apply that sum with some other activations function like the sigmoid or ReLU, etc.
If we arbitrary choose two hidden layers with 20 neurons each, the system has about 16330 weights and biases together that we can adjust. These weights and biases values tell us how our network will perform.
Then what we mean when we say that this network classifies a given digit? Is that the brightest of those 10 neurons in the last layer corresponds to that digit?
The second layer which is a hidden layer could pick up on the edges and the third layer could pick up on patterns like loops and lines.
And the last layer could just piece together those patterns to recognize the given digits. So here we learn how the network learns.
We want to write an algorithm where we can apply training data to this system which is in the form of different images of handwritten digits along with labels for what they are to be recognized and It will adjust those 16,330 weights and biases to improve its performance on the training data.
MNIST database has put together a collection of thousands of handwritten digit images each one is labeled with the numbers that they are supposed to be.
Basically, we are trying to find the minimum of a certain function, with each neuron is connected to all of the other neurons in the previous layer. The weights in the weighted sum defining its activation is like the strength of these connections and the bias is the indication of whether that neuron tends to be active or inactive.
So let’s take a simple example and try to understand what simply happening.
Imagine, one day you and your friends went for trekking. All of you reached on the top of a mountain. As you are tired and want some rest, you told your friends to move forward and get down you will be joining them after taking some rest.  While you trying to get down a mountain with a blindfold on. It’s impossible to know which direction to go in, but there’s one thing you can know: if you will be going down (making progress) or going up (losing progress). Eventually, if you keep taking steps that lead you downwards, you’ll reach the base.
Similarly, it’s impossible for us to know what our model’s weights should be right from start. But with some trial and error based on the loss function (whether you descending), you can end up getting there eventually.
We have to first initialize all of those weights and biases randomly.
Let’s imagine, we feed in this image of a 5, but due to this random weights and biases, the output layer just looks very different and complex.
So, to overcome this problem we define a cost function which tells us that the output should have activations which are 0 for most neurons and 1 for the neuron which is recognizing that digit...
So, mathematically we can tell that we have to add up the squares of the differences between each of error output activations and the value that we want them to have, this is nothing but the cost of a single training.
Like this we have multiple training examples and their cost function output values.
The sum of these values is small when the system confidently recognizes the image correctly but it's large when the system seems like it doesn't know what it's recognizing.
So, in this case we consider the average cost of all training examples available to us, which is the measure of how noisy the system is. The system was a function that takes 784 inputs, the pixel values and spits out ten numbers as its output using all these weights and biases.
The cost function takes as its input those 16,330 weights and biases and it produces a single number as a result.
Now a question arises how to change those weights and biases to get better results?
Let’s consider a function that has one number as an input and one number as an output. Here, we are going to find an input that minimizes the value of this function.
Technique which we are going to use here is to start at any input and find out in which direction you should take your step to make that output lower.
So basically, we find out the slope of the function where you are currently at and then shift to the left if that slope is positive or shift to the right if that slope is negative.
By repeatedly doing this at every point that is checking the new slope and taking the proper step, we will get local minimum of the function.
Because of starting with the random input, we cannot predict where the local minimum should be.
One thing to note down here is, step sizes are proportional to the slope and when the slope is flattening out towards the minimum, steps get smaller and smaller.
Let us consider a function with two inputs and one output, by taking input space in the XY plane and the cost function as a surface above it, here we have to consider in which direction to step-in this input space?
The gradient of a function gives you the direction of steepest ascent.
Now arises a new question, in which direction should we step to increase the function most quickly? taking the negative of that gradient gives you the direction to step that decreases the function most quickly.
It's the basic idea for a function that has 16,330 inputs instead of two inputs and organizing all 16,330 weights and biases of the system into a big column vector.
The negative gradient of the cost function is a vector. It's some direction inside this input space that tells you which nudges to all of those numbers are going to cause the rapid decrease to the cost function.
Changing the weights and biases will make the output of the network on each example of training.
This cost function involves an average overall of training data. This is what about the backpropagation.
The cost function should have a nice smooth output so that we can find a local minimum by taking little steps downhill.
The process of continuously putting an input of a function by some multiple of the negative gradient is called gradient descent.
So, let me summarize what actually gradient descent does?
It’s a way to converge towards a global minimum of a cost function.
  • It tries to calculate what a small change in each individual weight would do to the loss function (i.e. which direction should we move)
  • Then it adjusts each individual weight-based on its gradient (i.e. take a small step in the determined direction)
  • It keeps iterating step 1 and step 2 until the loss function gets as low as possible and reaches global minima to get the best performance model.
Local and Global minima
Each component of the negative gradient gives an idea about two things first the sign, of course, tells us whether the corresponding Component of the input, the vector should be pushed up or down, but relative magnitudes of all these components tell you which changes matter more.
An adjustment to one of the weights in our network might have a much greater impact on the cost function than the adjustment to some other weight.
For example, if you have some function with two variables as an input and you compute that its gradient at some particular point comes out as (3,1). You can also infer it as changes to the first variable have three times the importance as changes to the second variable that at least in the neighborhood of the relevant input.
So, when you initialize the network with random weights and biases and try to adjust them many times based on this gradient descent process
How well does is actually perform on the images that it’s never seen before?
With the simple neural network what we have consider above, we observe that the model performance is around 93-95%. But if we play around with the hidden layer structure and try to make the network more complicated with couple of tweaks, we can get the model performance up to 98%. That’s pretty good but not the best.
We can get the better performance by creating more complex network than the simple vanilla network.
That’s how neural networks works and with the help of backpropagation algorithm and gradient descent we get our best model by reducing the error term.
Finally, let me take a small example so that you may get more strong foundation about gradient descent.
Let’s try to predict the price of a new house from the given housing data:
In this example our task is to predict the price of a new house based on the size of the house.
Let’s plot the given historical housing data:
Line Chart
From the plot , we can see it follows a linear trend as the size of the house increases, the price is also increasing.
Next, let’s try to use a simple linear regression model, where we try to fit a line on the given historical data and predict the price of a new house (Ypred) based on its size (X).
Regression line
From the figure, we can see that the red line gives the predicted house price (i.e., Ypred) given house size(X).
So, Ypred can be given as: Ypred = a+bX
Now, the blue line represents the actual prices from the historical data i.e., Yactual.
The difference between Yactual and Ypred which is represented by the yellow dashed lines is called prediction error (error) E.
So now, our aim is to find a line with optimal values of a, b which is known as coefficients/weights that best fits the historical data by reducing the prediction error/ error and improving prediction accuracy.
Here, our goal is to find optimal values of a, b that can minimizes the error between actual and predicted values of house price.
So,  Sum of Squared Errors (SSE) = ½ Sum (Actual House Price – Predicted House Price)2
                                               SSE = ½ Sum (Y – Ypred)2
(NOTE: 1/2 is used for mathematical convenience since it helps us in calculating gradients in calculus easily.
There are other types of measures of Error. Here, we used SSE which is just one of them).
This is where the Gradient Descent comes into the picture. Gradient descent is an optimization algorithm that finds the optimal values of weights a, b that reduces the prediction error.
Let’s now try to understand the Gradient Descent algorithm with an example:
Below are the steps involved in the gradient descent algorithm are mentioned.
Step 1: First, Initialize the weights (a & b) with random values and calculate Error (SSE).
Step 2: Calculate the gradient i.e. change in SSE when the weights (a & b) are changed by a very small value from their original randomly initialized value. This helps us move the values of a & b in the direction in which SSE can be minimized.
Step 3: Now, we adjust the weights with the gradients to reach the optimal values so that SSE can be minimized.
Step 4: Use these new weights for prediction and then calculate the new SSE.
Step 5: Finally, we repeat steps 2 and 3 till further adjustments to weights doesn’t significantly reduce the Error.
Let’s now go through each of the steps in detail.
But before that, we have to standardize the data as it will make the optimization process faster and convenient.
Normal and standardize data
Step 1: To fit a line Ypred = a + b X, we start off with random values of a and b and then calculate the prediction error (SSE) accordingly as shown below.
Calculated SSE
Step 2: In our second step we calculate the error gradient w.r.t the weights as shown below.
∂SSE/∂a = – (Y-Ypred)
∂SSE/∂b = – (Y-Ypred) X
Here, SSE = ½ (Y-Ypred)2 = ½ (Y- (a+bX))2
Here, ∂SSE/∂a and ∂SSE/∂b are the gradients and which give the direction of the movement of a, b w.r.t to SSE.
Derivatives of a and b
Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimum.
Total SSE
Now, we need to update the random values of a, b so that we can move in the direction of optimal a, b where total SEE is minimum.
Update rules:
·        a – ∂SSE/∂a
·        b – ∂SSE/∂b
So, update rules:
1.     New a = a – lr * ∂SSE/∂a = 0.45-0.01*3.300 = 0.42
2.     New b = b – lr * ∂SSE/∂b= 0.75-0.01*1.545 = 0.73
Here, lr is the learning rate = 0.01, which is the pace of adjustment to the weights. I already explained you about learning rate earlier.
Note: These values are randomly taken to explain you the concepts. Don’t consider them as a fixed one.
Step 4: Now, we use these new a and b values for our prediction and then use it to calculate the new Total SSE.
Total SSE with new weights
Now we can observe from the above table, with the new prediction, the total SSE has gone down (0.677 to 0.553). That implies our prediction accuracy has improved and model will show better performance.
Step 5: Finally, in our last step we repeat step 3 and 4 till the time further adjustments to a, b doesn’t show significantly reduce in the error.
At that time, we have arrived at the optimal value of a, b with the highest prediction accuracy.
This is how the Gradient Descent Algorithm works behind the scene. This optimization algorithm and different variants form the core of many machine learning and Deep Learning.
Note: We don’t have to do all this manually. There are many built-in functions available which performs all this for us. Only we have to call them.
I hope, finally you came to know the what is Gradient, how its work and its importance. In next tutorials, I will try to up come with a detailed explanation of some others topics. For more blogs/courses on data science, machine learning, artificial intelligence and new technologies do visit us at InsideAIML.

Submit Review