Wednesday, April 16, 2014

Feasibility of machine learning: Hoeffding's inequality and Vapnik-Chervonenkis dimension

Here $\nu$ is the probability of your hypothesis function agrees with the actual observation within your sample, and $\mu$ is the same probability within the whole population.

Hoeffding's inequality tells us that if sample size $N$ is large enough and our tolerance $\epsilon$ is not too stringent, then the probability of our model getting it right within the sample is very likely to be close to the probability of our model getting it right in the whole population, thus enabling us to evaluation the usefulness of our model.

When we have multiple hypothesis, say $M$, then $P[|\nu - \mu| > \epsilon]$ will get larger, if we suppose all these $M$ hypotheses are disjoint, then

\begin{align} P[|\nu - \mu| > \epsilon] \le 2Me^{-2\epsilon^2 N} \end{align}

This presents a problem for us: if $M$ is infinity, then our model will probably never get close to reality. Fortunately, given a finite data set, $M$ couldn't be infinity. For example, in a binary classification situation, when the dependent variable $y$ can only take 0 or 1, then $M \le 2^N$.

Now let's look at 2 examples. In each case $M < 2^N$, because some hypotheses couldn't be a result of linear separation.

Here comes the idea of a break point, which means when the number of data point is larger than a certain threshold, $M$ will definitely be less than the upper bound $2^N$.

These can be calculated easily, for example $$N+1 < 2^N$$ $$(1/2)N^2 + (1/2)N + 1 < 2^N$$

$2^N$ is already a huge step down from infinity, but we have even better news. It turns out once there is a break point , the magnitude of $M$ will drop further from $2^N$ to polynomial:

As an illustration, suppose we have 3 data points in total and the break point is 2, then for any 2 out of the 3 points , we cannot form all possible $2^2$ hypotheses, if you enumerate all $2^3$ hypotheses, you will find only 4 of them are possible due to this restriction:

Here is a proof for $M$ will drop further from $2^N$ to polynomial. Define the number of possible hypotheses (called the growth function) as a function of the number of data points N and the breakpoint k i.e. $B(N, k)$.

Enumerate all hypotheses in the following table. (Notice that each $X_i$ is a data point, not a scalar.) We single out the last data point for the convenience of recursion in the next steps, and call $h(X_N)$ the extension, and the vector $(h(X_1), h(X_2), \cdots, h(X_{N-1}))$, we call the root. Each unique root has either 1 extension or 2 extensions, for example in this example, (0, 0, 1) has two extensions, while (1, 0, 1) has only one. \begin{align*} \begin{array}{ccc|c} 0&0&1&0 \\ 0&0&1&1 \\ 1&0&1&1 \\ \vdots \end{array} \end{align*} By this we gather all roots that have only one extension into one group $S_1$. And the rest we gather into $S_2$. $S_2$ can be further divided into 2 subgroups according the their extensions. Suppose $S_1$ has $\alpha$ rows and $S_2$ has $2\beta$ rows, then by definition \begin{align*} B(N, k) = \alpha + 2\beta \end{align*}

Now let's pay some attention to the root block of $S_1$, we observe that $\alpha + \beta \le B(N-1, k)$. Similarly we have $\beta \le B(N-1, k-1)$ (think about it, why is it $k-1$ here?) thus, \begin{align*} B(N, k) &= \alpha + 2\beta \\ &\le B(N-1, k) + B(N-1, k-1) \end{align*}

We have got an upper bound here and this can be done recursively (like a Pascal's triangle):


We want to keep doing this until we get the simplest form possible. What is the simplest form then? Observe: \begin{align*} B(N, \;\;1) &= 1 \\ &= B(N-1,\;\;1) + B(N-1, \;\;0) \\ &= B(N-1,\;\;1) + B(N-2, \;\;0) + B(N-2, \;\;-1) \\ &= \cdots \end{align*} From here we know $B(N, x) = 0$ for all non-positive $x$. Further: \begin{align*} B(1, \;\;1) &= 1 \\ &= B(0, \;\;1) + B(0, \;\;0) \\ &= B(0, \;\;1) + 0 \\ \end{align*} From here we know $B(0, k) = 1$ for all positive k. Now we see we can reduce $B(N, k)$ all the way down to \begin{align*} &{N \choose 0}B(0,\;k) \\ &{N \choose 1}B(0,\;k-1) \\ &{N \choose 2}B(0,\;k-2) \\ & \vdots \\ &{N \choose k-1}B(0,\;k-N) \\ \end{align*} $N > k$, so we have a lot of zeros there, eliminate them: \begin{align*} &{N \choose 0}B(0,\;k) \\ &{N \choose 1}B(0,\;k-1) \\ &{N \choose 2}B(0,\;k-2) \\ & \vdots \\ &{N \choose k-1}B(0,\;1) \\ \end{align*} In the above list, $B(...)$ are all equal to one. Sum them up and we have this important result: \begin{align*} B(N, \; k) \le \sum_{i=0}^{k-1}{N \choose i} \end{align*} which is a polynomial whose highest-degree term is $N^{k-1}$.

Recall Hoeffding's inequality: \begin{align} P[|\nu - \mu| > \epsilon] \le 2Me^{-2\epsilon^2 N} \end{align} We have reduced $M$ from infinity to polynomial: \begin{align} P[|\nu - \mu| > \epsilon] \le 2B(N, k)e^{-2\epsilon^2 N} \end{align} This looks good, any increase gained by the polynomial will likely be killed by the negative exponential, but we are a little too optimistic, it's actually: \begin{align} P[|\nu - \mu| > \epsilon] \le 4B(2N, k)e^{-(1/8)\epsilon^2 N} \end{align} This is called the Vapnik-Chervonenkis inequality, and k-1 is called the Vapnik-Chervonenkis dimension (VC dimension).