Implementation of Gradient Ascent using Logistic Regression

7 min readNov 1, 2020

Before understanding the gradient ascent let’s first understand what is Logistic Regression?

Logistic Regression

Logistic Regression is the machine learning classification algorithm which is used in predictive analysis. Logistic regression is almost similar to Linear regression but the main difference here is the cost function. Logistic Regression uses much more complex function namely log-likelihood Cost function whereas the other uses mean squared error(MSE) as the cost function. This function is based on the concept of probability and for a single training input (x,y), the assumption made by the function is

These probability values are then mapped by hypothesis which is Sigmoid function between 0 and 1.

What is Sigmoid Function?

It is used to map predictions to the range of 0 and 1. Basically the value that is received from performing hypothesis is passed into this function which maps it between 0 and 1. So it outputs the probability value.

So lets look at the implementation of the sigmoid function.

In the above Source code, function takes a single parameter as a input that is the numpy array(z) and then returns the numpy array of mapped probability value between 0 and 1. The self in the function just represent the instance of the class. It has nothing to do with the output of the function.

Let’s talk about the cost function

Cost function is basically the loss function in terms of linear regression but in Gradient ascent we need to maximize the cost function.

So in Machine Learning, the cost functions are calculated as how far is the predicted value from the actual value. Thus the difference between them is either minimized for Gradient descent or is Maximized for Gradient Ascent. In Gradient ascent it is called as Log Likelihood Estimation or Maximum Likelihood estimation. Let’s see how we can perform log likelihood estimation.

So since its a Cost Function so we have to pass the actual value and the predicted value as the parameter which is nothing but numpy array here.

So we compute the log likelihood here which is calculated by the given formula.

So while calculating the Gradient(slope) in the Gradient ascent which will be discussed further in this blog, we take the partial differentiation of the above function with respect to (theta) to find maximum likelihood.

You might be thinking that numpy array above may have a value 0 and at 0 Log(0) is undefined, then to prevent this condition we can take some minimum value instead of 0 to avoid undefine condition.

So here if 0 comes then it takes the minimum value which is assigned to the variable ‘self.eps’.

Gradient Ascent

So after seeing the cost function and sigmoid function, now let’s think of an algorithm which can combine these two function and gives us the desired result. The algorithm is the Gradient Ascent algorithm.

So Gradient Ascent is an iterative optimization algorithm for finding local maxima of a differentiable function. The algorithm moves in the direction of gradient calculated at each and every point of the cost function curve till the stopping criteria meets.

Now we have the log likelihood function, so we simply need to choose the values of θ that maximizes the function. In short we will partially derivate the function with respect to θ(theta). Here θ is the weight matrix which can be assigned to 0 or to any random values between 0 and 1 in the starting.

So lets derivate the function.

so we got the last term as the partial derivative of the log_likelhood

Now we have partial derivative, so our goal is to choose the parameters (θ) that maximizes the likelihood. Here comes the Gradient Ascent. The idea behind gradient ascent is that gradient points ‘uphill’. So if you slowly slowly moves towards the direction of gradient then you eventually make it to the global maxima. Gradient ascent has an analogy in which we have to imagine ourselves at the bottom of a mountain valley and left stranded and blindfolded, our objective is to reach the top of the hill.

Now to maximize our log likelihood we need to run the gradient ascent function on each parameter i.e.

weights = weights + learning_rate*gradient

Here α is the Learning rate of the model, which is the step size that we take towards uphill. So by running this function we reach to the optimum θ where the Likelihood is maximum.

Lets look at the code of Gradient Ascent

So in the above function we take X (X_train) and y (y_train) as input which are numpy ndarray. First we initialise the weights (θ) matrix with 0’s or any random value between 0 and 1.

Now we perform hypothesis and calculate the probability values of the input data ‘X’. Then we calculate the gradient of the log_likelihood function which is just the partial derivative of the function with respect to weights. Then the weights are updated using gradient ascent function. This process is repeated for N number of iterations specified by the user. So slowly we reach to the maxima with the help of gradient ascent.

This is basically the fitting of the model and now lets evaluate our model if its correctly predicting the result or not.

For this lets take some dataset. Download the dataset as done below. I am using the divorce dataset. You can use any other classification dataset.

Perform Pre-processing like Label/One Hot Encoding, normalize it using MinMax Scaler, add a bias, check for missing values if yes then input it. Here the dataset which I have chosen is already label encoded and has short range of values so their is no need of normalizing. Otherwise normalization is must.

Now we have the data frame so lets segregate the original data into training and testing data. For segregation we will use train_test_split function of sklearn library. It is advisable to split the dataset in the ratio of (70:30, 75:25, 80:20).

In this particular code we take all the rows and n-1 columns of the dataframe in ‘x’ as the input & all the rows and last column in the ‘y’ as the target. Further the dataframe is converted into numpy array.

Now you have X_train and y_train, so first initialize the model and pass these two into the fit function in the gradient ascent created above. Here default learning rate is taken as 0.01 which is the advisable to start with. Though in the initial phases of learning do try to play with the hyperparameters to study its affect on the model.

Now the model will be trained once you run the fit function. Now comes the prediction, all these above tasks are being performed to make prediction on the new sample of data. So let’s so the prediction on the testing set of the data that was segregated above. Simultaneously we will create the function to compute the accuracy of the model on testing set.

So after this whole hard work of understanding algorithm and writing the code we finally got the accuracy which was 88.46153846153846.

We may even plot the graph of Log likelihood vs No. of Iterations and it result into the perfect smooth curve as expected.

Thus i will sum up the whole code of the model for your understanding. Try the implementation with different types of classification datasets, and do let me in the comments if you face any difficulty.

In the above predict function, i passed the numpy array as the input which is the list of sample whose values we need to classify. The values of the array are first passed to the predict_proba() function which return the probability values. These values are then converted into the 1 or 0 depending upon the threshold value set by us. Here the value of threshold is set to be as 0.5, then if the value returned from predict_proba is greater than equal to 0.5 it is scaled to 1 otherwise downscaled to 0.

Conclusion

In this blog, i have presented you with the basic concept of the Gradient Ascent Algorithm with the example. Hope you have understood its implementation and is highly motivated towards Artificial Intelligence and Machine Learning.

Always remember that Artificial Intelligence is the New Electricity and my friend you are the lineman producing and controlling it.