Logistic regression with gradient descent algorithm

First the hypothesis function, which models the relationshiop between the dependent variable $P(y = 1)$ and the independent variable $x$: \begin{align*} H_i = h(x_i) &= \frac{1}{1 + e^{-x_i \cdot \beta}} \end{align*} where $x_i$ is the $i$th row of the design matrix $\underset{m \times n}{X}$, or in matrix form: \begin{align*} H &= \frac{1}{1 + e^{-X \beta}} \end{align*} H is a $m\times 1$ matrix. Except for $X\beta$ all operations are element-wise.

The cost function $J$ is a measure of deviance of the modeled dependent variable from the observed $y$ (Notice how the cost varies with $h$ when $y = 0$ and $y = 1$, respectively):

Plot the above figure with Octave:

h = var("h") p1 = plot(-log(h), (h, 0, 1), axes_labels=['$h_i$', '$J_i$'], legend_label="$-\log h$", color="blue" ) p2 = plot(-log(1-h), (h, 0, 1), axes_labels=['$h_i$', '$J_i$'], legend_label="$-\log (1-h)$", color="red" ) show(p1+p2, figsize=[4, 4])

Until convergence: \[ \beta := \beta - \alpha \frac{\partial J}{\partial \beta} \] This looks strange at first glance, but it actually makes sense. $\alpha$ here is called the learning rate. As shown in the figure below, we want to reach the point where the slope is zero.

When the slope is big, it is far from zero, so we can move at bigger steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is big, and as we approaches the goal, the slope is near zero, and we must be careful and take smaller steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is small, these are all taken care of automatically by $\frac{\partial J}{\partial \beta}$, and as long as $\alpha$ is not too big, $\beta $ will converge (we can use step halving to avoid over-sized $\alpha$). The result of this iterative assignment is the optimal $\beta$.

Finally, an implementation of this algorithm in R:

Batch gradient descent in R

logreg = function(y, x) { alpha = .25 x = as.matrix(x) # x = apply(x, 2, scale) x = cbind(1, x) m = nrow(x) n = ncol(x) b = matrix(rnorm(n)) # b = matrix(rep(0, n)) h = 1 / (1 + exp(-x %*% b)) J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h)) derivJ = t(x) %*% (h-y) niter = 0 while(1) { niter = niter + 1 newb = b - alpha * derivJ cat("newb........\n") print(newb) h = 1 / (1 + exp(-x %*% newb)) newJ = -(t(y) %*% log(h) + t(0-y) %*% log(1 -h)) # step halving # if cost function J is increasing # or abs val of b is too big (leading to aberrant log h ) # then halve the learning rate alpha while((J - newJ < 0) || max(abs(newb)) > 5) { alpha = alpha / 2 newb = b - alpha * derivJ h = 1 / (1 + exp(-x %*% newb)) newJ = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h)) } # at convergence, stop iteration if(max(abs(b - newb)) < 0.00001) { break } b = newb J = newJ derivJ = t(x) %*% (h-y) } print(b) } testglm = function() { nr = 5000 nc = 50 x = matrix(rnorm(nr*nc), nr) y = matrix(sample(0:1, nr, repl=T), nr) res = apply(x, 2, function(coli) summary(glm(y~coli, family=binomial))$coef[2, ]) # print(res) } testlogreg = function() { nr = 500 nc = 50 x = matrix(rnorm(nr*nc), nr) y = matrix(sample(0:1, nr, repl=T), nr) res = apply(x, 2, function(coli) logreg(y, coli)) } system.time(testlogreg()) system.time(testglm())

Here is the link to derivation of cost function via maximum likelihood:

A lot of the content here are adapted from Andrew Ng's amazing course here: https://class.coursera.org/ml-005/lecture

## 0 comments:

Post a Comment