Logistic Regression - Introduction and Cost Function
INTRODUCTION TO LOGISTIC REGRESSION
Difference between Linear & Logistic Regression
![]() |
Fig 1 : Linear Vs Logistic Regression |
In simple words, as seen in FIG 1 linear regression is used when the our output/predicted variable is continuous variable while logistic regression is used when our output/predicted variable required is binary as .
![]() |
Fig 2 : Plot of Linear Vs Logistic Regression |
As seen in FIG 2 the data on the 3D-plot on left side is Linear Regression is used best where the data are continuous. While the right side plot is data points that are grouped together and here logistic regression is better, where our outcome required is a binary(1/0, yes/no, true/false)
What is a Sigmoid Function ?
Logistic
regression algorithm also uses a linear equation with independent predictors to
predict a value. The predicted value can be anywhere between negative infinity
to positive infinity. We need the output of the algorithm to be class variable,
i.e 0-no, 1-yes. Therefore, we are squashing the output of the linear equation
into a range of [0,1]. To squash the predicted value between 0 and 1, we use
the Sigmoid function.
$g(z)=\frac{1}{1+{{e}^{-z}}}$
we already know the equation of linear regression is
${{h}_{\theta }}(x)={{\theta }_{0}}+{{\theta }_{1}}{{x}_{1}}+{{\theta }_{2}}{{x}_{2}}+...+{{\theta }_{m}}{{x}_{m}}$
${{h}_{\theta }}(x)={{\theta }^{T}}x$
Where,
${{\theta }_{0}},{{\theta }_{1}}...{{\theta }_{m}}$ - Coefficients or Parameters
Inserting the hypothesis of linear regression into the Sigmoid function we get
${{h}_{\theta }}(x)=\frac{1}{1+{{e}^{-{{\theta }^{T}}x}}}$
Lets take single feature problem such we see if a animal is dog or cat based on its picture. Lets say in test case looking at the picture that there is a 90% probability that the animal is dog
${{h}_{\theta }}(x)={{\theta }_{0}}+{{\theta }_{1}}{{x}_{1}}$
$h_\theta(x) = 0.9$
We can now say that for this specific test case for the new image probability of $y=1$ (that the image is that of a Dog) is 0.9.
${{h}_{\theta }}(x)=P(y=1|x;\theta )=0.9$
The above equation can be read as Probability of $y=1$ given for a value of $x$ (in our case an image of animal) for parameters $\theta$ Since we know that if the probability of it being a dog is 0.9 then probability of it being a cat is 0.1.
Based on this we can write,
$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$
$0.1+0.9=1$
Cost Function.
We know the cost function(Mean Squared Error) of a linear regression is as follows.
$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
where,
${{h}_{\theta }}({{x}^{(i)}})\to \$ Predicted Value of outcome
${{y}^{(i)}}\to \$ Actual Value of outcome
So again cost function we used in linear regression
$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
which can be rewritten in a slightly different way,
$J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$
rewrite the cost function for the linear regression as follows,
${Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$
$J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$
$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
where,
${{h}_{\theta }}({{x}^{(i)}})\to \$ Predicted Value of outcome
${{y}^{(i)}}\to \$ Actual Value of outcome
Linear Regression cost function does not work here !
The above cost function of linear regression
cannot be used because in linear regression, we minimize the mean squared error
using any optimization algorithm because the cost function is a convex function. It has only one local
or global minima as see in FIG 4.
![]() |
FIG 4 : Cost Function in Linear Regression |
If you try to
use the linear regression's cost function to generate in a logistic regression
problem, you would end up with a non-convex function: a weirdly-shaped graph with no easy to find minimum global point, as seen in FIG
5
![]() |
FIG 5 : Cost Function of Linear Regression cannot be used |
$J(\vec{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
which can be rewritten in a slightly different way,
$J(\vec{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$
rewrite the cost function for the linear regression as follows,
${Cost}(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$
$J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$
Cost Function for Logistic Regression
In case y=1, the output (i.e. the cost to pay) approaches to 0 as hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0. You can clearly see it in the FIG 6 . This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the outcome is completely wrong.
Conversely, the same intuition applies when y=0, depicted in the FIG 7. Bigger penalties when the label is y=0 but the algorithm predicts hθ(x)=1
We can rewrite the above cost function as in one liner as
${Cost}(h_\theta(x),y) = -y \log(h_\theta(x)) - (1 - y) \log(1-h_\theta(x))$
If you need to verify ,by plugging in y = 1 and y = 0 we will arrive at the same cost function as the original cost function
Finally the cost function for logistic regression can we written as follows
Now by using Gradient Descent and above cost function we can find the best values for the parameters.
Soon to come....
Footer Note :
Finally, this has been my first blog ever..but the learning has be amazing. learnt a bit of HTML, The struggle to input the math equations into blog took a while and had to browse through a lot of blogs and websites to get the concept right !Apologies for all the mistakes and credits to all the sources.I shall list them in soon.
Nice article
ReplyDelete