Monday, April 7, 2014

Logistic regression with gradient descent algorithm



Logistic regression with gradient descent algorithm

First the hypothesis function, which models the relationshiop between the dependent variable $P(y = 1)$ and the independent variable $x$: \begin{align*} H_i = h(x_i) &= \frac{1}{1 + e^{-x_i \cdot \beta}} \end{align*} where $x_i$ is the $i$th row of the design matrix $\underset{m \times n}{X}$, or in matrix form: \begin{align*} H &= \frac{1}{1 + e^{-X \beta}} \end{align*} H is a $m\times 1$ matrix. Except for $X\beta$ all operations are element-wise.
The cost function $J$ is a measure of deviance of the modeled dependent variable from the observed $y$ (Notice how the cost varies with $h$ when $y = 0$ and $y = 1$, respectively): sage0
Plot the above figure with Octave:
h = var("h")
p1 = plot(-log(h), (h, 0, 1),
          axes_labels=['$h_i$', '$J_i$'],
    legend_label="$-\log h$",
    color="blue"
    )
p2 = plot(-log(1-h), (h, 0, 1),
          axes_labels=['$h_i$', '$J_i$'],
    legend_label="$-\log (1-h)$",
    color="red"
    )
show(p1+p2, figsize=[4, 4])
\begin{align*} J_i &= \begin{cases} -\log H_i, & y_i = 1 \\ -\log (1-H_i), & y_i = 0 \end{cases} \\ &= -y_i\log H_i - (1-y_i)\log (1-H_i) \end{align*} Here $J_i$ is the ith addend of $J$, $H_i$ is the ith element of H. \begin{align*} J &= (1/m) \sum_{i = 1}^m J_i \\ &= (1/m)\sum_{i = 1}^m [-y_i\log H_i - (1-y_i)\log (1-H_i)] \\ &= (1/m)\sum_{i = 1}^m \left[ -y_i \log \frac{1}{1+e^{- x_i \cdot \beta}} - (1 - y_i) \log \left( 1 - \frac{1}{1+e^{- x_i \cdot \beta}} \right) \right] \\ &= (1/m)\sum_{i = 1}^m \left[ y_i \log (1+e^{-x_i \cdot \beta}) + (1 - y_i) \log \left( 1+e^{x_i \cdot \beta} \right) \right] \\ \end{align*} The goal of the game is to find the value of $\beta$ to minimize $J$. So we need the derivative of J. The result is an $n\times 1$ matrix. \begin{align*} \frac{\partial J}{\partial \beta_j} &= \dfrac {1} {m} \sum_{i=1}^m \left[ y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) + \left( 1-y_{i}\right) \dfrac {1} {1+e^{X_{i}\cdot\beta }}e^{X_{i}\cdot\beta }X_{ij} \right] \\ &= \dfrac {1} {m} \sum_{i=1}^m \left[ y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) + \left( 1-y_{i}\right) H_i X_{ij} \right] \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( -y_{i}e^{-X_{i}\cdot \beta }+1-y_{i}\right) \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i}\left( 1+e^{X_i\cdot\beta }\right) \right) \\ &= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i} / H_i\right) \\ &= \sum_{i=0}^m \left( H_{i}-y_{i}\right) X_{ij} \\ &= (H - y) \cdot X_j \\\\ \frac{\partial J}{\partial \beta} &= X^T (H-y) \end{align*} These all look very complicated, but $J$ is just a function of $\beta$, and as a typical optimization problem, we need to find the $\beta$ to make $\frac{\partial J}{\partial \beta_j}$ equal to zero. To get there we use the gradient descent algorithm, which can be summarized as below:

Until convergence: \[ \beta := \beta - \alpha \frac{\partial J}{\partial \beta} \] This looks strange at first glance, but it actually makes sense. $\alpha$ here is called the learning rate. As shown in the figure below, we want to reach the point where the slope is zero.
When the slope is big, it is far from zero, so we can move at bigger steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is big, and as we approaches the goal, the slope is near zero, and we must be careful and take smaller steps, i.e. $\alpha \frac{\partial J}{\partial \beta}$ is small, these are all taken care of automatically by $\frac{\partial J}{\partial \beta}$, and as long as $\alpha$ is not too big, $\beta $ will converge (we can use step halving to avoid over-sized $\alpha$). The result of this iterative assignment is the optimal $\beta$.


Sans titre

Finally, an implementation of this algorithm in R:
Batch gradient descent in R
logreg = function(y, x) {
 alpha = .25
 x = as.matrix(x)
 # x = apply(x, 2, scale)
 x = cbind(1, x)
 m = nrow(x)
 n = ncol(x)

 b = matrix(rnorm(n))
 # b = matrix(rep(0, n))
 h = 1 / (1 + exp(-x %*% b))


 J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
 derivJ = t(x) %*% (h-y)

 niter = 0
 while(1) {
  niter = niter + 1
  newb = b - alpha * derivJ
  cat("newb........\n")
  print(newb)
  h = 1 / (1 + exp(-x %*% newb))
  newJ = -(t(y) %*% log(h) + t(0-y) %*% log(1 -h))

  # step halving
  # if cost function J is increasing
  # or abs val of b is too big (leading to aberrant log h )
  # then halve the learning rate alpha
  while((J - newJ < 0) || max(abs(newb)) > 5) {
   alpha = alpha / 2
   newb = b - alpha * derivJ
   h = 1 / (1 + exp(-x %*% newb))
   newJ = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
  }
  # at convergence, stop iteration
  if(max(abs(b - newb)) < 0.00001)
  {
   break
  }
  b = newb
  J = newJ
  derivJ = t(x) %*% (h-y)
 }
 print(b)
}

testglm = function() {
 nr = 5000
 nc = 50
        x = matrix(rnorm(nr*nc), nr)
        y = matrix(sample(0:1, nr, repl=T), nr)
        res = apply(x, 2, function(coli) summary(glm(y~coli, family=binomial))$coef[2, ])
        # print(res)
        }
testlogreg = function() {
 nr = 500
 nc = 50
        x = matrix(rnorm(nr*nc), nr)
        y = matrix(sample(0:1, nr, repl=T), nr)
        res = apply(x, 2, function(coli) logreg(y, coli))
}
system.time(testlogreg())
system.time(testglm())

Here is the link to derivation of cost function via maximum likelihood: snapshot8
A lot of the content here are adapted from Andrew Ng's amazing course here: https://class.coursera.org/ml-005/lecture

0 comments: