Quadratic form with a diagonal matrix in the middle, application in the calculation of Hessian

Quadratic form with a diagonal matrix in the middle, application in the calculation of Hessian

In the simplest form, where $u, \; v$ are column vectors.

\begin{align*} \begin{pmatrix} u^T \end{pmatrix} \begin{pmatrix} w_1 & 0 & 0 \\ 0 & w_2 & 0 \\ 0 & 0 & w_3 \end{pmatrix} \begin{pmatrix} v \end{pmatrix} &= \begin{pmatrix} u^T_1w_1 & u^T_2w_2 & u^T_3w_3 \end{pmatrix} \begin{pmatrix} v \end{pmatrix} \\ & = \begin{pmatrix} u^T_1w_1v_1 & u^T_2w_2v_2 & u^T_3w_3v_3 \end{pmatrix} \end{align*}

This can be generalized to $U, \; V$ (matrices):

\begin{align*} C &= U \begin{pmatrix} w_1 & 0 & 0 \\ 0 & w_2 & 0 \\ 0 & 0 & w_3 \end{pmatrix} V \\ &= \begin{pmatrix} U_{11} & U_{12} & U_{13} \\ U_{21} & U_{22} & U_{23} \\ \end{pmatrix} \begin{pmatrix} w_1 & 0 & 0 \\ 0 & w_2 & 0 \\ 0 & 0 & w_3 \end{pmatrix} \begin{pmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \\ V_{31} & V_{32} \\ \end{pmatrix} \\ &= \begin{pmatrix} U_{11}w_1 & U_{12}w_2 & U_{13}w_3 \\ U_{21}w_1 & U_{22}w_2 & U_{23}w_3 \\ \end{pmatrix} \begin{pmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \\ V_{31} & V_{32} \\ \end{pmatrix} \\ & = \begin{pmatrix} \sum_i U_{1i}w_iV_{i1} & \sum_i U_{1i}w_iV_{i2} \\ \sum_i U_{2i}w_iV_{i1} & \sum_i U_{2i}w_iV_{i2} \\ \end{pmatrix} \end{align*}

Put it more concisely:

\begin{align*} C_{jk} &= \sum_i U_{ji}w_iV_{ik} \end{align*}

This comes up in the calculation of Hessian for the cost (likelihood) function in logistic regression (with batch gradient descent in this post). Remember the gradient is:

\begin{align*} \frac{\partial J}{\partial \beta_j} &= \sum_{i=0}^m (H_{i}-y_{i}) X_{ij} \\ &= \sum_{i=0}^m (\frac{1}{1 + e^{-X_i\beta}}-y_{i}) X_{ij} \\ \end{align*}

In order to get the Hessian, we need to take the second derivative:

\begin{align*} \frac{\partial^2 J}{\partial \beta_k \partial \beta_j} &= \sum_{i=0}^m \frac{\partial}{\partial \beta_k}(\frac{1}{1 + e^{-X_i\beta}}-y_{i}) X_{ij} \\ &= \sum_{i=0}^m -\frac{1}{(1 + e^{-X_i\beta})^2}e^{-X_i\beta}(-X_{ik})X_{ij} \\ &= \sum_{i=0}^m \frac{1}{(1 + e^{-X_i\beta})^2}e^{-X_i\beta}X_{ik}X_{ij} \\ \end{align*}

Now let $w_i = \frac{1}{(1 + e^{-X_i\beta})^2}e^{-X_i\beta}$, we have

\begin{align*} \frac{\partial^2 J}{\partial \beta_k \partial \beta_j} &= \sum_{i=0}^m \frac{1}{(1 + e^{-X_i\beta})^2}e^{-X_i\beta}X_{ik}X_{ij} \\ &= \sum_{i=0}^m w_iX_{ik}X_{ij} \\ \end{align*}

Therefore

\begin{align*} H_J &= X^TWX \end{align*}

where $W = \text{diag }(w_1, w_2, \cdots w_m)$ (using a sparse format for this matrix will save a lot of computation!)

From here you can get the standard errors and z-scores of coefficients this way:

\begin{align*} \sigma &= \sqrt{\text{diag}(H_J^{-1})} \\ z &= ( \beta - 0 ) / \sigma \end{align*} This is called the Wald test.