Commit 3d77b222 authored by Carsten Eie Frigaard's avatar Carsten Eie Frigaard
Browse files

update

parent 3a90cc65
%% Cell type:markdown id: tags:
# ITMAL Exercise
## The E.T. and Gradient Descent for Linear Regression
<img src="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/et.jpg" alt="WARNING: image could not be fetched" style="height:250px">
The Good News is that aliens are here, yes, really! The Bad News is that You are assigned teacher for one Extra-Terrestial _creature_.
Your task is to create a document (journal) for the _creature_, explaining all about Gradient Descent related to the linear regression model.
The _creature_ needs about four, max. six normal pages of text otherwise it becomes very grumpy like a Gremlin.
However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit.
As you job of being an E.T.-teacher, You must cover Gradient Decent for a simple Linear Regression model with at least the following concepts:
* Linear Regression model prediction in vectorized form
* MSE cost function for a Linear Regression model
* Closed-form solution (normal equation)
* Numerical gradient decent
* Batch Gradient Descent
* Stochastic Gradient Descent
* Learning rates
Feel free to add additional Gradient Decent concepts, but remember to keep the text You submit below max. six pages (exluding plot, code and math).
Note that you could peek into the other notebooks for this lesson, copying math, code, and plots from these are allowed.
(Once submitted as a hand-in in Brightspace, I will forward it to the E.T., but expect no direct feedback from the _creature_..)
%% Cell type:code id: tags:
``` python
# TODO: Your GD documentation for the E.T.
```
%% Cell type:markdown id: tags:
REVISIONS| |
---------| |
2021-0926| CEF, initial.
......
%% Cell type:markdown id: tags:
# ITMAL Exercise
## Gradient Descent Methods and Training
Finding the optimal solution in one-step, via
$$
\newcommand\rem[1]{}
\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}
\newcommand\eq[2]{#1 &=& #2\\}
\newcommand\ar[2]{\begin{array}{#1}#2\end{array}}
\newcommand\ac[2]{\left[\ar{#1}{#2}\right]}
\newcommand\st[1]{_{\mbox{\scriptsize #1}}}
\newcommand\norm[1]{{\cal L}_{#1}}
\newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}}
\newcommand\diff[1]{\mbox{d}#1}
\newcommand\pown[1]{^{(#1)}}
\def\pownn{\pown{n}}
\def\powni{\pown{i}}
\def\powtest{\pown{\mbox{\scriptsize test}}}
\def\powtrain{\pown{\mbox{\scriptsize train}}}
\def\bX{\mathbf{M}}
\def\bX{\mathbf{X}}
\def\bZ{\mathbf{Z}}
\def\bw{\mathbf{m}}
\def\bx{\mathbf{x}}
\def\by{\mathbf{y}}
\def\bz{\mathbf{z}}
\def\bw{\mathbf{w}}
\def\btheta{{\boldsymbol\theta}}
\def\bSigma{{\boldsymbol\Sigma}}
\def\half{\frac{1}{2}}
\newcommand\pfrac[2]{\frac{\partial~#1}{\partial~#2}}
\newcommand\dfrac[2]{\frac{\mbox{d}~#1}{\mbox{d}#2}}
\bw^* ~=~ \left( \bX^\top \bX \right)^{-1} \bX^\top \by
$$
has its downsides: the scaling problem of the matrix inverse. Now, let us look at a numerical solution to the problem of finding the value of $\bw$ (aka $\btheta$) that minimizes the objective function $J$.
Again, ideally we just want to find places, where the (multi-dimensionally) gradient of $J$ is zero (here using a constant factor $\frac{2}{m}$)
$$
\ar{rl}{
\nabla_\bw J(\bw) &= \frac{2}{m} \bX^\top \left( \bX \bw - \by \right)\\
}
$$
and numerically we calculate $\nabla_{\bw} J$ for a point in $\bw$-space, and then move along in the opposite direction of this gradient, taking a step of size $\eta$
$$
\bw^{(step~N+1)} = \bw^{(step~N)} - \eta \nabla_{\bw} J(\bw)
$$
That's it, pretty simple, right (apart from numerical stability, problem with convergence and regularization, that we will discuss later).
So, we begin with some initial $\bw$, and iterate via the equation above, towards places, where $J$ is smaller, and this can be illustrated as
<img src="Figs/minimization_gd.png" style="height:240px">
<img src="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization.png" alt="WARNING: image could not be fetched" style="height:240px">
If we hit the/a global minimum or just a local minimum (or in extremely rare cases a local saddle point) is another question when not using a simple linear regression model: for non-linear models we will in general not see a nice convex $J$-$\bw$ surface, as in the figure above.
### Qa The Gradient Descent Method (GD)
Explain the gradient descent algorithm using the equations [HOML] p.114-115. and relate it to the code snippet
```python
X_b, y = GenerateData()
eta = 0.1
n_iterations = 1000
m = 100
theta = np.random.randn(2,1)
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
```
in the python code below.
As usual, avoid going top much into details of the code that does the plotting.
What role does `eta` play, and what happens if you increase/decrease it (explain the three plots)?
%% Cell type:code id: tags:
``` python
# TODO: Qa...examine the method (without the plotting)
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
def GenerateData():
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
return X, X_b, y
X, X_b, y = GenerateData()
eta = 0.1
n_iterations = 1000
m = 100
theta = np.random.randn(2,1)
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
print(f'stochastic gradient descent theta={theta.ravel()}')
##########################################################
# rest of the code is just for plotting, needs no review
def plot_gradient_descent(theta, eta, theta_path=None):
m = len(X_b)
plt.plot(X, y, "b.")
n_iterations = 1000
for iteration in range(n_iterations):
if iteration < 10:
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--"
plt.plot(X_new, y_predict, style)
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
if theta_path is not None:
theta_path.append(theta)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 2, 0, 15])
plt.title(r"$\eta = {}$".format(eta), fontsize=16)
np.random.seed(42)
theta_path_bgd = []
theta = np.random.randn(2,1) # random initialization
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd)
plt.subplot(133); plot_gradient_descent(theta, eta=0.5)
plt.show()
print('OK')
```
%%%% Output: stream
stochastic gradient descent theta=[4.15435925 2.94177079]
%%%% Output: display_data
![]()
%%%% Output: stream
OK
%% Cell type:markdown id: tags:
### Qb The Stochastic Gradient Descent Method (SGD)
Now, introducing the _stochastic_ variant of gradient descent, explain the stochastic nature of the SGD, and comment on the difference to the _normal_ gradient descent method (GD) we just saw.
Also explain the role of the calls to `np.random.randint()` in the code,
HINT: In detail, the important differences are, that the main loop for SGC is
```python
for epoch in range(n_epochs):
for i in range(m):
.
.
.
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = ...
theta = ...
```
where it for the GD method was just
```python
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = ..
```
NOTE: the call `np.random.seed(42)` resets the random generator so that it produces the same random-sequence when re-running the code.
%% Cell type:code id: tags:
``` python
# TODO: Qb...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
theta_path_sgd = []
m = len(X_b)
np.random.seed(42)
n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
def learning_schedule(t):
return t0 / (t + t1)
theta = np.random.randn(2,1) # random initialization
for epoch in range(n_epochs):
for i in range(m):
if epoch == 0 and i < 20:
y_predict = X_new_b.dot(theta)
style = "b-" if i > 0 else "r--"
plt.plot(X_new, y_predict, style)
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
theta_path_sgd.append(theta)
plt.plot(X, y, "b.")
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty=None, eta0=0.1, random_state=42)
sgd_reg.fit(X, y.ravel())
print(f'stochastic gradient descent theta={theta.ravel()}')
print(f'Scikit-learn SGDRegressor "thetas": sgd_reg.intercept_={sgd_reg.intercept_}, sgd_reg.coef_={sgd_reg.coef_}')
##########################################################
# rest of the code is just for plotting, needs no review
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()
print('OK')
```
%%%% Output: stream
stochastic gradient descent theta=[3.84208754 3.03831083]
Scikit-learn SGDRegressor "thetas": sgd_reg.intercept_=[3.83534891], sgd_reg.coef_=[2.96948592]
%%%% Output: display_data
![]()
%%%% Output: stream
OK
%% Cell type:markdown id: tags:
### Qc Adaptive learning rate for $\eta$
There is also an adaptive learning rate method in the demo code for the SGD.
Explain the effects of the `learning_schedule()` functions.
You can set the learning rate parameter (also known as a hyperparameter) in may ML algorithms, say for SGD regression, to a method of your choice
```python
SGDRegressor(max_iter=1,
eta0=0.0005,
learning_rate="constant", # or 'adaptive' etc.
random_state=42)
```
but as usual, there is a bewildering array of possibilities...we will tackle this problem later when searching for the optimal hyperparameters.
NOTE: the `learning_schedule()` method could also have been used in the normal SG algorithm; is not directly part of the stochastic method, but a concept in itself.
%% Cell type:code id: tags:
``` python
# TODO: Qc...in text
```
%% Cell type:markdown id: tags:
### Qd Mini-batch Gradient Descent Method
Finally explain what a __mini-batch__ SG method is, and how it differs from the two others.
Again, take a peek into the demo code below, to extract the algorithm details...and explain the __main differences__, compared with the GD and SGD.
%% Cell type:code id: tags:
``` python
# TODO: Qd...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
theta_path_mgd = []
n_iterations = 50
minibatch_size = 20
np.random.seed(42)
theta = np.random.randn(2,1) # random initialization
t0, t1 = 200, 1000
def learning_schedule(t):
return t0 / (t + t1)
t = 0
for epoch in range(n_iterations):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for i in range(0, m, minibatch_size):
t += 1
xi = X_b_shuffled[i:i+minibatch_size]
yi = y_shuffled[i:i+minibatch_size]
gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(t)
theta = theta - eta * gradients
theta_path_mgd.append(theta)
print(f'mini-batch theta={theta.ravel()}')
print('OK')
```
%%%% Output: stream
mini-batch theta=[3.86814831 3.02878845]
OK
%% Cell type:markdown id: tags:
### Qe Choosing a Gradient Descent Method
Explain the $θ_0−θ_1$ plot below, and make a comment on when to use GD/SGD/mini-batch gradient descent (pros and cons for the different methods).
%% Cell type:code id: tags:
``` python
# TODO: Qd...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)
plt.figure(figsize=(7,4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()
print('OK')
```
%%%% Output: display_data
![]()
%%%% Output: stream
OK
%% Cell type:markdown id: tags:
### [OPTIONAL] Qf Extend the MyRegressor
Can you extend the `MyRegressor` class from the previous notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method.
You could add a parameter for the class, indicating it what mode it should be operating: analytical closed-form or numerical, like
```python
class MyRegressor(BaseEstimator):
def __init__(self, numerical = False):
self.__w = None
self.__numerical_mode = numerical
.
.
.
```
%% Cell type:code id: tags:
``` python
# TODO: Qf...[OPTIONAL]
```
%% Cell type:markdown id: tags:
REVISIONS| |
---------| |
2018-0214| CEF, initial.
2018-0214| CEF, added optional exe.
2018-0220| CEF, major update.
2018-0220| CEF, fixed revision table malformatting.
2018-0225| CEF, removed =0 in expression.
2018-0225| CEF, minor text updates and made Qc optional.
2018-0225| CEF, minor source code cleanups.
2021-0918| CEF, update to ITMAL E21.
......
%% Cell type:markdown id: tags:
# ITMAL Exercise
REVISIONS|
---------|------------------------------------------------
2018-1218|CEF, initial.
2018-0214|CEF, major update.
2018-0218|CEF, fixed error in nabla expression.
2018-0218|CEF, added minimization plot.
2018-0218|CEF, added note on argmin/max.
2018-0218|CEF, changed concave to convex.
## Training a Linear Regressor I
The goal of the linear regression is to find the argument $w$ that minimizes the sum-of-squares error over all inputs.
Given the usual ML input data matrix $\mathbf X$ of size $(n,d)$ where each row is an input column vector $(\mathbf{x}^{(i)})^\top$ data sample of size $d$
$$
\newcommand\rem[1]{}
\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}
\newcommand\eq[2]{#1 &=& #2\\}
\newcommand\ar[2]{\begin{array}{#1}#2\end{array}}
\newcommand\ac[2]{\left[\ar{#1}{#2}\right]}
\newcommand\st[1]{_{\mbox{\scriptsize #1}}}
\newcommand\norm[1]{{\cal L}_{#1}}
\newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}}
\newcommand\diff[1]{\mbox{d}#1}
\newcommand\pown[1]{^{(#1)}}
\def\pownn{\pown{n}}
\def\powni{\pown{i}}
\def\powtest{\pown{\mbox{\scriptsize test}}}
\def\powtrain{\pown{\mbox{\scriptsize train}}}
\def\bX{\mathbf{M}}
\def\bX{\mathbf{X}}
\def\bZ{\mathbf{Z}}
\def\bw{\mathbf{m}}
\def\bx{\mathbf{x}}
\def\by{\mathbf{y}}
\def\bz{\mathbf{z}}
\def\bw{\mathbf{w}}
\def\btheta{{\boldsymbol\theta}}
\def\bSigma{{\boldsymbol\Sigma}}
\def\half{\frac{1}{2}}
\newcommand\pfrac[2]{\frac{\partial~#1}{\partial~#2}}
\newcommand\dfrac[2]{\frac{\mbox{d}~#1}{\mbox{d}#2}}
\bX =
\ac{cccc}{
x_1\pown{1} & x_2\pown{1} & \cdots & x_d\pown{1} \\
x_1\pown{2} & x_2\pown{2} & \cdots & x_d\pown{2}\\
\vdots & & & \vdots \\
x_1\pownn & x_2\pownn & \cdots & x_d\pownn\\
}
$$
and $\by$ is the target output column vector of size $n$
$$
\by =
\ac{c}{
y\pown{1} \\
y\pown{2} \\
\vdots \\
y\pown{n} \\
}
$$
The linear regression model, via its hypothesis function and for a column vector input $\bx\powni$ of size $d$ and a column weight vector $\bw$ of size $d+1$ (with the additional element $w_0$ being the bias), can now be written as simple as
$$
\ar{rl}{
h(\bx\powni;\bw) &= \bw^\top \ac{c}{1\\\bx\powni} \\
&= w_0 + w_1 x_1\powni + w_2 x_2\powni + \cdots + w_d x_d\powni
}
$$
using the model parameters or weights, $\bw$, aka $\btheta$. To ease notation $\bx$ is assumed to have the 1 element prepended in the following so that $\bx$ is a $d+1$ column vector
$$
\ar{rl}{
\ac{c}{1\\\bx\powni} &\mapsto \bx\powni, ~~~~\mbox{by convention in the following...}\\
h(\bx\powni;\bw) &= \bw^\top \bx\powni
}
$$
This is actually the first fully white-box machine learning algorithm, that we see. All the glory details of the algorithm are clearly visible in the internal vector multiplication...quite simple, right? Now we just need to train the weights...
### Loss or Objective Function - Formulation for Linear Regression
The individual cost (or loss), $L\powni$, for a single input-vector $\bx\powni$ is a measure of how the model is able to fit the data: the higher the $L\powni$ value the worse it is able to fit. A loss of $L=0$ means a perfect fit.
It can be given by, say, the square difference from the calculated output, $h$, to the desired output, $y$
$$
\ar{rl}{
L\powni &= \left( h(\bx\powni;\bw) - y\powni \right)^2\\
&= \left( \bw^\top\bx\powni - y\powni \right)^2
}