The classification problem is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1.
Logistic Regression Model
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
Logistic Regression Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
simplified cost function and Gradient Descent
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized.
one vs all = N * one vs rest
one vs rest = logistic regression classifier
problem of overfitting
- Reduce the number of features
- Manually select which features to keep
- use a model selection algorithm
- keep all the features, but reduce the magnitude of parameters theta-j
- Regularization works well when we have a lot of slightly useful features.
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
cost function with lambda
gradient descent with lambda