Logistic Regression: Machine Learning

Shubham singh
7 min readFeb 13, 2021

Logistic regression is a statistical model that performs predictions for categorical variables for instance “success” or “failure”, “yes” or “no”, “win” or “loss”. In most cases, logistic regression is used to solve the binary classification problem. But we can also use logistic regression in multi-class classification.

Logistic Function

The logistic function is an “S” shape curve that maps predicted values to probabilities. Target variable y is the probability of a given variable x belonging to a certain class. Because the probability lies from the range 0 to 1 Thus, it is obvious that the value of y should lie between 0 and 1. And This is what the logistic function does it converts values between zero and one.

Where e is the Euler's number or Euler's constant whose value is 2.7182.

Hypothesis Function Of Logistic Regression

The hypothesis function of logistic regression is very similar to linear regression, the only difference is that it converts values in binary form (0,1) instead of numeric values.

Where g is the sigmoid or logistic function.

There is another way to write the hypothesis function by combining both formulas.

Decision Boundary

Decision boundary helps us to better understand when the model is going to predict 1 or 0. The sigmoid activation function returns a probability score between 0 and 1 or we can map these values in discrete class (‘no’, ‘ yes’). In order to draw a decision boundary, we have to choose a threshold value. For instance, if the threshold value is 0.5, then values equal to or above 0.5 will be classified as 1 or in this case ‘yes’. Conversely, all the values below 0.5 will be classified as 0 or ‘no’.

Decision Boundary
complex decision boundary

Cost Function

We cannot use the MSE (Mean Squared Error) function in a logistic regression as in linear regression. The simplest explanation is that our prediction function sigmoid is nonlinear, so the cost function would be non-convex with several local minima and if you run gradient descent on this function, it is not guaranteed that the algorithm will converge to the global minimum. We need a convex shape cost function so that it can converge to the global minimum. The cost function which is used in logistic regression is cross-entropy or also known as the log loss function.

Or we can write the above equations by combining them in a single equation.

There are only two results in binary classification problem y=0 or 1.

Case-1: if y=1, the second part of the cost function will be zero. Then the equation will be:

Case-2: if y=0, then in this case the first part of the equation will be zero. Then the equation will b

Finally, our cost function is

Gradient Descent

Gradient descent is an optimization function that is used to reduce the cost function. For logistic regression, gradient descent is exactly the same as for linear regression.

Multinomial Logistic regression

Logistic regression is mainly effective in binary classification. But for multi-class classification, we use the “one-to-many” approach. In this method, the target variable is divided into several categories. And fit a logistic regression model on each sub-problem.

One vs all (one vs rest) classification

Let’s assume you want to classify whether the fruit is Apple, Banana, or Mango. This is a multi-class classification problem. You can refer to the figure below.

In the “one-to-many” classification method, we treat the data set in this way: Apples belong to the first class, bananas and mangos belong to the second class and train the first classifier. Next, bananas belong to the first class, and the rest of the fruits belong to the second class and train the classifier again, and we follow the same procedure to train the third classifier. In this way, we must train three classifiers. The first classifier learns apples, the second classifier learns bananas, and the third classifier learns how to classify mangoes.

Softmax Function

Softmax function converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector. The Softmax function is the generalization of the logistic function to multiclass dimensional. Before applying the Softmax function, the sum vector component may be less than 0 or greater than 1. But after applying the Softmax function, all the components will be in the interval of (0,1). And if we add all the probabilistic components, the value will be 1. Softmax function can be used for multiclass classification in logistic regression, artificial neural network, or Naive Bayes classifier.

Evaluation of classification problem

After creating the model, the next step is to test the accuracy of the model. In regression models, we calculate the distance between true and predicted values like R squared score, Mean squared error. Whereas in classification problems accuracy is generally calculated by the confusion matrix.

Confusion matrix

The confusion matrix looks like this

True Positive: A result that was predicted as positive by the model and also positive.

True Negative: A result that was predicted as negative by the model and also is negative.

False Positive: A result that was predicted as positive by the model but actually is negative.

False Negative: A result that was predicted as negative by the model but actually is positive.

Accuracy of the model

The accuracy of the model is the ratio of true positive and true negative to the true positive, true negative, false positive, and false negative.

Recall or sensitivity of the model

Recall or sensitivity shows how many of the total positive results are correctly classified by the model. Therefore, it shows the relevance of the model in terms of positive results.

Precision

Precision shows that out of total predictions by the model how many of them were actually positive.

Precision-Recall tradeoff

The precision-recall curve shows the trade-off between precision and recall for different thresholds. The high area under the curve represents a high recall rate and high precision. High precision is related to a low false-positive rate, and a high recall rate is related to a low false-negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recall but low precision returns many results, but most of its predicated labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high accuracy and high recall will return many results, and all results will be correctly labeled.

F1 score

From the above example, it is very clear that accuracy and recall both are equally important for the model prediction. So, to solve this problem another metric F1 score is introduced. It is the harmonic mean of precision and recall.

Specificity or true negative rate

Specificity is the total number of negatives predicted by the model with respect to the total number of actual negatives.

--

--