## Logistic regression with gradient descent algorithm

Logistic regression with gradient descent algorithm

First the hypothesis function, which models the relationshiop between the dependent variable $P(y = 1)$ and the independent variable $x$: \begin{align*} H_i = h(x_i) &= \frac{1}{1 + e^{-x_i \cdot \beta}} \end{align*} where $x_i$ is the $i$th row of the design matrix $\underset{m \times n}{X}$, or in matrix form: \begin{align*} H &= \frac{1}{1 + e^{-X \beta}} \end{align*} H is a $m\times 1$ matrix. Except for $X\beta$ all operations are element-wise.
The cost function $J$ is a measure of deviance of the modeled dependent variable from the observed $y$ (Notice how the cost varies with $h$ when $y = 0$ and $y = 1$, respectively):
Plot the above figure with Octave:
h = var("h")
p1 = plot(-log(h), (h, 0, 1),
axes_labels=['$h_i$', '$J_i$'],
legend_label="$-\log h$",
color="blue"
)
p2 = plot(-log(1-h), (h, 0, 1),
axes_labels=['$h_i$', '$J_i$'],
legend_label="$-\log (1-h)$",
color="red"
)
show(p1+p2, figsize=[4, 4])

\begin{align*} J_i &= \begin{cases} -\log H_i, & y_i = 1 \\ -\log (1-H_i), & y_i = 0 \end{cases} \\ &= -y_i\log H_i - (1-y_i)\log (1-H_i) \end{align*} Here $J_i$ is the ith addend of $J$, $H_i$ is the ith element of H. \begin{align*} J &= (1/m) \sum_{i = 1}^m J_i \\ &= (1/m)\sum_{i = 1}^m [-y_i\log H_i - (1-y_i)\log (1-H_i)] \\ &= (1/m)\sum_{i = 1}^m \left[ -y_i \log \frac{1}{1+e^{- x_i \cdot \beta}} - (1 - y_i) \log \left( 1 - \frac{1}{1+e^{- x_i \cdot \beta}} \right) \right] \\ &= (1/m)\sum_{i = 1}^m \left[ y_i \log (1+e^{-x_i \cdot \beta}) + (1 - y_i) \log \left( 1+e^{x_i \cdot \beta} \right) \right] \\ \end{align*} The goal of the game is to find the value of $\beta$ to minimize $J$. So we need the derivative of J. The result is an $n\times 1$ matrix. \begin{align*} \frac{\partial J}{\partial \beta_j} &= \dfrac {1} {m} \sum_{i=1}^m \left[ y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) + \left( 1-y_{i}\right) \dfrac {1} {1+e^{X_{i}\cdot\beta }}e^{X_{i}\cdot\beta }X_{ij} \right] \\ &= \dfrac {1} {m} \sum_{i=1}^m \left[ y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) + \left( 1-y_{i}\right) H_i X_{ij} \right] \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( -y_{i}e^{-X_{i}\cdot \beta }+1-y_{i}\right) \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i}\left( 1+e^{X_i\cdot\beta }\right) \right) \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i} / H_i\right) \\ &= \sum_{i=0}^m \left( H_{i}-y_{i}\right) X_{ij} \\ &= (H - y) \cdot X_j \\\\ \frac{\partial J}{\partial \beta} &= X^T (H-y) \end{align*} These all look very complicated, but $J$ is just a function of $\beta$, and as a typical optimization problem, we need to find the $\beta$ to make $\frac{\partial J}{\partial \beta_j}$ equal to zero. To get there we use the gradient descent algorithm, which can be summarized as below:

Until convergence: $\beta := \beta - \alpha \frac{\partial J}{\partial \beta}$ This looks strange at first glance, but it actually makes sense. $\alpha$ here is called the learning rate. As shown in the figure below, we want to reach the point where the slope is zero.
When the slope is big, it is far from zero, so we can move at bigger steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is big, and as we approaches the goal, the slope is near zero, and we must be careful and take smaller steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is small, these are all taken care of automatically by $\frac{\partial J}{\partial \beta}$, and as long as $\alpha$ is not too big, $\beta$ will converge (we can use step halving to avoid over-sized $\alpha$). The result of this iterative assignment is the optimal $\beta$.

Finally, an implementation of this algorithm in R:
logreg = function(y, x) {
alpha = .25
x = as.matrix(x)
# x = apply(x, 2, scale)
x = cbind(1, x)
m = nrow(x)
n = ncol(x)

b = matrix(rnorm(n))
# b = matrix(rep(0, n))
h = 1 / (1 + exp(-x %*% b))

J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
derivJ = t(x) %*% (h-y)

niter = 0
while(1) {
niter = niter + 1
newb = b - alpha * derivJ
cat("newb........\n")
print(newb)
h = 1 / (1 + exp(-x %*% newb))
newJ = -(t(y) %*% log(h) + t(0-y) %*% log(1 -h))

# step halving
# if cost function J is increasing
# or abs val of b is too big (leading to aberrant log h )
# then halve the learning rate alpha
while((J - newJ < 0) || max(abs(newb)) > 5) {
alpha = alpha / 2
newb = b - alpha * derivJ
h = 1 / (1 + exp(-x %*% newb))
newJ = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
}
# at convergence, stop iteration
if(max(abs(b - newb)) < 0.00001)
{
break
}
b = newb
J = newJ
derivJ = t(x) %*% (h-y)
}
print(b)
}

testglm = function() {
nr = 5000
nc = 50
x = matrix(rnorm(nr*nc), nr)
y = matrix(sample(0:1, nr, repl=T), nr)
res = apply(x, 2, function(coli) summary(glm(y~coli, family=binomial))\$coef[2, ])
# print(res)
}
testlogreg = function() {
nr = 500
nc = 50
x = matrix(rnorm(nr*nc), nr)
y = matrix(sample(0:1, nr, repl=T), nr)
res = apply(x, 2, function(coli) logreg(y, coli))
}
system.time(testlogreg())
system.time(testglm())


Here is the link to derivation of cost function via maximum likelihood:
A lot of the content here are adapted from Andrew Ng's amazing course here: https://class.coursera.org/ml-005/lecture