Commit 6fe1ff4a authored by Carsten Eie Frigaard's avatar Carsten Eie Frigaard
Browse files

update

parent e8473ee3
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
# ITMAL Exercise
## Training a Linear Regressor I
The goal of the linear regression is to find the argument $w$ that minimizes the sum-of-squares error over all inputs.
Given the usual ML input data matrix $\mathbf X$ of size $(n,d)$ where each row is an input column vector $(\mathbf{x}^{(i)})^\top$ data sample of size $d$
$$
\newcommand\rem[1]{}
\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}
\newcommand\eq[2]{#1 &=& #2\\}
\newcommand\ar[2]{\begin{array}{#1}#2\end{array}}
\newcommand\ac[2]{\left[\ar{#1}{#2}\right]}
\newcommand\st[1]{_{\mbox{\scriptsize #1}}}
\newcommand\norm[1]{{\cal L}_{#1}}
\newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}}
\newcommand\diff[1]{\mbox{d}#1}
\newcommand\pown[1]{^{(#1)}}
\def\pownn{\pown{n}}
\def\powni{\pown{i}}
\def\powtest{\pown{\mbox{\scriptsize test}}}
\def\powtrain{\pown{\mbox{\scriptsize train}}}
\def\bX{\mathbf{M}}
\def\bX{\mathbf{X}}
\def\bZ{\mathbf{Z}}
\def\bw{\mathbf{m}}
\def\bx{\mathbf{x}}
\def\by{\mathbf{y}}
\def\bz{\mathbf{z}}
\def\bw{\mathbf{w}}
\def\btheta{{\boldsymbol\theta}}
\def\bSigma{{\boldsymbol\Sigma}}
\def\half{\frac{1}{2}}
\newcommand\pfrac[2]{\frac{\partial~#1}{\partial~#2}}
\newcommand\dfrac[2]{\frac{\mbox{d}~#1}{\mbox{d}#2}}
\bX =
\ac{cccc}{
x_1\pown{1} & x_2\pown{1} & \cdots & x_d\pown{1} \\
x_1\pown{2} & x_2\pown{2} & \cdots & x_d\pown{2}\\
\vdots & & & \vdots \\
x_1\pownn & x_2\pownn & \cdots & x_d\pownn\\
}
$$
and $\by$ is the target output column vector of size $n$
$$
\by =
\ac{c}{
y\pown{1} \\
y\pown{2} \\
\vdots \\
y\pown{n} \\
}
$$
The linear regression model, via its hypothesis function and for a column vector input $\bx\powni$ of size $d$ and a column weight vector $\bw$ of size $d+1$ (with the additional element $w_0$ being the bias), can now be written as simple as
$$
\ar{rl}{
h(\bx\powni;\bw) &= \bw^\top \ac{c}{1\\\bx\powni} \\
&= w_0 + w_1 x_1\powni + w_2 x_2\powni + \cdots + w_d x_d\powni
}
$$
using the model parameters or weights, $\bw$, aka $\btheta$. To ease notation $\bx$ is assumed to have the 1 element prepended in the following so that $\bx$ is a $d+1$ column vector
$$
\ar{rl}{
\ac{c}{1\\\bx\powni} &\mapsto \bx\powni, ~~~~\mbox{by convention in the following...}\\
h(\bx\powni;\bw) &= \bw^\top \bx\powni
}
$$
This is actually the first fully white-box machine learning algorithm, that we see. All the glory details of the algorithm are clearly visible in the internal vector multiplication...quite simple, right? Now we just need to train the weights...
### Loss or Objective Function - Formulation for Linear Regression
The individual cost (or loss), $L\powni$, for a single input-vector $\bx\powni$ is a measure of how the model is able to fit the data: the higher the $L\powni$ value the worse it is able to fit. A loss of $L=0$ means a perfect fit.
It can be given by, say, the square difference from the calculated output, $h$, to the desired output, $y$
$$
\ar{rl}{
L\powni &= \left( h(\bx\powni;\bw) - y\powni \right)^2\\
&= \left( \bw^\top\bx\powni - y\powni \right)^2
}
$$
To minimize all the $L\powni$ losses (or indirectly also the MSE or RMSE) is to minimize the sum of all the
individual costs, via the total cost function $J$
$$
\ar{rl}{
\mbox{MSE}(\bX,\by;\bw) &= \frac{1}{n} \sum_{i=1}^{n} L\powni \\
&= \frac{1}{n} \sum_{i=1}^{n} \left( \bw^\top\bx\powni - y\powni \right)^2\\
&= \frac{1}{n} ||\bX \bw - \by||_2^2
}
$$
here using the squared Euclidean norm, $\norm{2}^2$, via the $||\cdot||_2^2$ expressions.
Now the factor $\frac{1}{n}$ is just a constant and can be ignored, yielding the total cost function
$$
\ar{rl}{
J &= \frac{1}{2} ||\bX \bw - \by||_2^2\\
&\propto \mbox{MSE}
}
$$
adding yet another constant, 1/2, to ease later differentiation of $J$.
### Training
Training the linear regression model now amounts to computing the optimal value of the $\bw$ weight; that is finding the $\bw$-value that minimizes the total cost
$$
\bw^* = \mbox{argmin}_\bw~J\\
$$
where $\mbox{argmin}_\bw$ means find the argument of $\bw$ that minimizes the $J$ function. This minimum (sometimes a maximum, via argmax) is denoted $\bw^*$ in most ML literature.
The minimization can in 2-D visually be drawn as finding the lowest $J$ that for linear regression always form a convex shape
<img src="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization.png" alt="WARNING: image could not be fetched" style="height:240px">
### Training: The Closed-form Solution
To solve for $\bw^*$ in closed form (i.e. directly, without any numerical approximation), we find the gradient of $J$ with respect to $\bw$. Taking the partial deriverty $\partial/\partial_\bw$ of the $J$ via the gradient (nabla) operator
$$
\rem{
\frac{\partial}{\partial \bw} =
\ac{c}{
\frac{\partial}{\partial w_1} \\
\frac{\partial}{\partial w_2} \\
\vdots\\
\frac{\partial}{\partial w_d}
}
}
\nabla_\bw~J =
\left[ \frac{\partial J}{\partial w_1}, \frac{\partial J}{\partial w_2}, \ldots , \frac{\partial J}{\partial w_m} \right]^\top
$$
and setting it to zero yields the optimal solution for $\bw$, and ignoring all constant factors of 1/2
and $1/n$
$$
\ar{rl}{
\nabla_\bw J(\bw) &= \bX^\top \left( \bX \bw - \by \right) ~=~ 0\\
0 &= \bX^\top\bX \bw - \bX^\top\by
}
$$
giving the closed-form solution, with $\by = [y\pown{1}, y\pown{2}, \cdots,
y\pown{n}]^\top$
$$
\bw^* ~=~ \left( \bX^\top \bX \right)^{-1} \bX^\top \by
$$
You already know this method from math, finding the extrema for a function, say
$$f(w)=w^2-2w-2$$
so is given by finding the place where the gradient $\mbox{d}~f(w)/\mbox{d}w = 0$
$$
\dfrac{f(w)}{w} = 2w -2 = 0
$$
so we see that there is an extremum at $w=1$. Checking the second deriverty tells if we are seeing a minimum, maximum or a saddlepoint at that point. In matrix terms, this corresponds to finding the _Hessian_ matrix and gets notational tricky due to the multiple feature dimensions involved.
> https://en.wikipedia.org/wiki/Ordinary_least_squares
> https://en.wikipedia.org/wiki/Hessian_matrix
#### Qa Write a Python function that uses the closed-form to find $\bw^*$
Use the test data, `X1` and `y1` in the code below to find `w1` via the closes-form. Use the test vectors for `w1` to test your implementation, and remember to add the bias term (concat an all-one vector to `X` before solving).
%% Cell type:code id: tags:
``` python
# TODO: Qa...
# TEST DATA:
import numpy as np
from libitmal import utils as itmalutils
itmalutils.ResetRandom()
X1 = np.array([[8.34044009e-01],[1.44064899e+00],[2.28749635e-04],[6.04665145e-01]])
y1 = np.array([5.97396028, 7.24897834, 4.86609388, 3.51245674])
w1_expected = np.array([4.046879011698, 1.880121487278])
itmalutils.AssertInRange(w1_expected, w1_expected,eps=1E-9)
assert False, "find the least-square solution for X1 and y1, your implementation here, say from [HOML] p.114"
# w1 = ...
# TEST VECTOR:
itmalutils.PrintMatrix(w1, label="w1=", precision=12)
itmalutils.AssertInRange(w1,w1_expected,eps=1E-9)
```
%% Cell type:markdown id: tags:
#### Qb Find the limits of the least-square method
Again find the least-square optimal value for `w2` now using `X2` and `y2` as inputs.
Describe the problem with the matrix inverse, and for what `M` and `N` combinations do you see, that calculation of the matrix inverse takes up long time?
%% Cell type:code id: tags:
``` python
# TODO: Qb...
# TEST DATA: Matrix, taken from [HOML], p108
M=100
N=1
print(f'More test data, M={10}, N={N}...')
X2=2 * np.random.rand(M,N)
y2=4 + 3*X2 + np.random.randn(M,1)
y2=y2[:,0] # well, could do better here!
assert False, "find the least-square solution for X2 and y2, again"
# w2 =
```
%% Cell type:markdown id: tags:
REVISIONS| |
---------| |
2018-1218| CEF, initial.
2018-0214| CEF, major update.
2018-0218| CEF, fixed error in nabla expression.
2018-0218| CEF, added minimization plot.
2018-0218| CEF, added note on argmin/max.
2018-0218| CEF, changed concave to convex.
2021-0926| CEF, update for ITMAL E21.
2011-1002| CEF, corrected page numbers for HOML v2 (109=>114).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment