Commit ed0a8c03 authored by Carsten Eie Frigaard's avatar Carsten Eie Frigaard
Browse files


parent 2e897d1c
%% Cell type:markdown id: tags:
# ITMAL Exercise
## Training a Linear Regressor I
The goal of the linear regression is to find the argument $w$ that minimizes the sum-of-squares error over all inputs.
Given the usual ML input data matrix $\mathbf X$ of size $(n,d)$ where each row is an input column vector $(\mathbf{x}^{(i)})^\top$ data sample of size $d$
\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}
\newcommand\eq[2]{#1 &=& #2\\}
\newcommand\st[1]{_{\mbox{\scriptsize #1}}}
\newcommand\norm[1]{{\cal L}_{#1}}
\newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}}
\def\powtest{\pown{\mbox{\scriptsize test}}}
\def\powtrain{\pown{\mbox{\scriptsize train}}}
\bX =
x_1\pown{1} & x_2\pown{1} & \cdots & x_d\pown{1} \\
x_1\pown{2} & x_2\pown{2} & \cdots & x_d\pown{2}\\
\vdots & & & \vdots \\
x_1\pownn & x_2\pownn & \cdots & x_d\pownn\\
and $\by$ is the target output column vector of size $n$
\by =
y\pown{1} \\
y\pown{2} \\
\vdots \\
y\pown{n} \\
The linear regression model, via its hypothesis function and for a column vector input $\bx\powni$ of size $d$ and a column weight vector $\bw$ of size $d+1$ (with the additional element $w_0$ being the bias), can now be written as simple as
h(\bx\powni;\bw) &= \bw^\top \ac{c}{1\\\bx\powni} \\
&= w_0 + w_1 x_1\powni + w_2 x_2\powni + \cdots + w_d x_d\powni
using the model parameters or weights, $\bw$, aka $\btheta$. To ease notation $\bx$ is assumed to have the 1 element prepended in the following so that $\bx$ is a $d+1$ column vector
\ac{c}{1\\\bx\powni} &\mapsto \bx\powni, ~~~~\mbox{by convention in the following...}\\
h(\bx\powni;\bw) &= \bw^\top \bx\powni
This is actually the first fully white-box machine learning algorithm, that we see. All the glory details of the algorithm are clearly visible in the internal vector multiplication...quite simple, right? Now we just need to train the weights...
### Loss or Objective Function - Formulation for Linear Regression
The individual cost (or loss), $L\powni$, for a single input-vector $\bx\powni$ is a measure of how the model is able to fit the data: the higher the $L\powni$ value the worse it is able to fit. A loss of $L=0$ means a perfect fit.
It can be given by, say, the square difference from the calculated output, $h$, to the desired output, $y$
L\powni &= \left( h(\bx\powni;\bw) - y\powni \right)^2\\
&= \left( \bw^\top\bx\powni - y\powni \right)^2
To minimize all the $L\powni$ losses (or indirectly also the MSE or RMSE) is to minimize the sum of all the
individual costs, via the total cost function $J$
\mbox{MSE}(\bX,\by;\bw) &= \frac{1}{n} \sum_{i=1}^{n} L\powni \\
&= \frac{1}{n} \sum_{i=1}^{n} \left( \bw^\top\bx\powni - y\powni \right)^2\\
&= \frac{1}{n} ||\bX \bw - \by||_2^2
here using the squared Euclidean norm, $\norm{2}^2$, via the $||\cdot||_2^2$ expressions.
Now the factor $\frac{1}{n}$ is just a constant and can be ignored, yielding the total cost function
J &= \frac{1}{2} ||\bX \bw - \by||_2^2\\
&\propto \mbox{MSE}
adding yet another constant, 1/2, to ease later differentiation of $J$.
### Training
Training the linear regression model now amounts to computing the optimal value of the $\bw$ weight; that is finding the $\bw$-value that minimizes the total cost
\bw^* = \mbox{argmin}_\bw~J\\
where $\mbox{argmin}_\bw$ means find the argument of $\bw$ that minimizes the $J$ function. This minimum (sometimes a maximum, via argmax) is denoted $\bw^*$ in most ML literature.
The minimization can in 2-D visually be drawn as finding the lowest $J$ that for linear regression always form a convex shape
<img src="" alt="WARNING: image could not be fetched" style="height:240px">
### Training: The Closed-form Solution
To solve for $\bw^*$ in closed form (i.e. directly, without any numerical approximation), we find the gradient of $J$ with respect to $\bw$. Taking the partial deriverty $\partial/\partial_\bw$ of the $J$ via the gradient (nabla) operator
\frac{\partial}{\partial \bw} =
\frac{\partial}{\partial w_1} \\
\frac{\partial}{\partial w_2} \\
\frac{\partial}{\partial w_d}
\nabla_\bw~J =
\left[ \frac{\partial J}{\partial w_1}, \frac{\partial J}{\partial w_2}, \ldots , \frac{\partial J}{\partial w_m} \right]^\top
and setting it to zero yields the optimal solution for $\bw$, and ignoring all constant factors of 1/2
and $1/n$
\nabla_\bw J(\bw) &= \bX^\top \left( \bX \bw - \by \right) ~=~ 0\\
0 &= \bX^\top\bX \bw - \bX^\top\by
giving the closed-form solution, with $\by = [y\pown{1}, y\pown{2}, \cdots,
\bw^* ~=~ \left( \bX^\top \bX \right)^{-1} \bX^\top \by
You already know this method from math, finding the extrema for a function, say
so is given by finding the place where the gradient $\mbox{d}~f(w)/\mbox{d}w = 0$
\dfrac{f(w)}{w} = 2w -2 = 0
so we see that there is an extremum at $w=1$. Checking the second deriverty tells if we are seeing a minimum, maximum or a saddlepoint at that point. In matrix terms, this corresponds to finding the _Hessian_ matrix and gets notational tricky due to the multiple feature dimensions involved.
#### Qa Write a Python function that uses the closed-form to find $\bw^*$
Use the test data, `X1` and `y1` in the code below to find `w1` via the closes-form. Use the test vectors for `w1` to test your implementation, and remember to add the bias term (concat an all-one vector to `X` before solving).
Use the test data, `X1` and `y1` in the code below to find `w1` via the closed-form. Use the test vectors for `w1` to test your implementation, and remember to add the bias term (concat an all-one vector to `X` before solving).
%% Cell type:code id: tags:
``` python
# TODO: Qa...
import numpy as np
from libitmal import utils as itmalutils
X1 = np.array([[8.34044009e-01],[1.44064899e+00],[2.28749635e-04],[6.04665145e-01]])
y1 = np.array([5.97396028, 7.24897834, 4.86609388, 3.51245674])
w1_expected = np.array([4.046879011698, 1.880121487278])
itmalutils.AssertInRange(w1_expected, w1_expected,eps=1E-9)
assert False, "find the least-square solution for X1 and y1, your implementation here, say from [HOML] p.114"
# w1 = ...
itmalutils.PrintMatrix(w1, label="w1=", precision=12)
%% Cell type:markdown id: tags:
#### Qb Find the limits of the least-square method
Again find the least-square optimal value for `w2` now using `X2` and `y2` as inputs.
Describe the problem with the matrix inverse, and for what `M` and `N` combinations do you see, that calculation of the matrix inverse takes up long time?
%% Cell type:code id: tags:
``` python
# TODO: Qb...
# TEST DATA: Matrix, taken from [HOML], p108
print(f'More test data, M={10}, N={N}...')
X2=2 * np.random.rand(M,N)
y2=4 + 3*X2 + np.random.randn(M,1)
y2=y2[:,0] # well, could do better here!
assert False, "find the least-square solution for X2 and y2, again"
# w2 =
%% Cell type:markdown id: tags:
---------| |
2018-1218| CEF, initial.
2018-0214| CEF, major update.
2018-0218| CEF, fixed error in nabla expression.
2018-0218| CEF, added minimization plot.
2018-0218| CEF, added note on argmin/max.
2018-0218| CEF, changed concave to convex.
2021-0926| CEF, update for ITMAL E21.
2011-1002| CEF, corrected page numbers for HOML v2 (109=>114).
%% Cell type:markdown id: tags:
# ITMAL Exercise
## Convolutional Neural Networks (CNNs)
Excercise 9 form [HOML], p496:
__"9. Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST."__
For the journal:
* write an introduction to CNNs (what are CNNs, what is a convolution layer, etc..),
* document your experiments towards the end-goal of reaching 'a high accuracy' (what did you try, what work/did not work),
* document how you use '_generalization_' in your setup (us of simple hold-out/train-test split or k-fold, or etc..),
* produce some sort of '_learning-curve_' that illustrates the drop in cost- or increase in score-function with respect to, say training iteration (for inspiration see fig 4.20, 10-12 or 10.17 in [HOML])
* document the final CNN setup (layers etc., perhaps as a graph/drawing),
* discus on your iterations towards the end-goal and other findings you had,
* and, as always, write a conclusion.
%% Cell type:code id: tags:
``` python
# TODO: CNN implemetation..
%% Cell type:markdown id: tags:
---------| |
2021-1020| CEF, initial version, clone from [HOML].
2021-1026| CEF, added learning curve item.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment