## The E.T. and Gradient Descent for Linear Regression
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/et.jpg"alt="WARNING: image could not be fetched"style="height:250px">
The Good News is that aliens are here, yes, really! The Bad News is that You are assigned teacher for one Extra-Terrestial _creature_.
Your task is to create a document (journal) for the _creature_, explaining all about Gradient Descent related to the linear regression model.
The _creature_ needs about four, max. six normal pages of text otherwise it becomes very grumpy like a Gremlin.
However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit.
As you job of being an E.T.-teacher, You must cover Gradient Decent for a simple Linear Regression model with at least the following concepts:
* Linear Regression model prediction in vectorized form
* MSE cost function for a Linear Regression model
* Closed-form solution (normal equation)
* Numerical gradient decent
* Batch Gradient Descent
* Stochastic Gradient Descent
* Learning rates
Feel free to add additional Gradient Decent concepts, but remember to keep the text You submit below max. six pages (exluding plot, code and math).
Note that you could peek into the other notebooks for this lesson, copying math, code, and plots from these are allowed.
(Once submitted as a hand-in in Brightspace, I will forward it to the E.T., but expect no direct feedback from the _creature_..)
has its downsides: the scaling problem of the matrix inverse. Now, let us look at a numerical solution to the problem of finding the value of $\bw$ (aka $\btheta$) that minimizes the objective function $J$.
Again, ideally we just want to find places, where the (multi-dimensionally) gradient of $J$ is zero (here using a constant factor $\frac{2}{m}$)
and numerically we calculate $\nabla_{\bw} J$ for a point in $\bw$-space, and then move along in the opposite direction of this gradient, taking a step of size $\eta$
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization.png"alt="WARNING: image could not be fetched"style="height:240px">
If we hit the/a global minimum or just a local minimum (or in extremely rare cases a local saddle point) is another question when not using a simple linear regression model: for non-linear models we will in general not see a nice convex $J$-$\bw$ surface, as in the figure above.
### Qa The Gradient Descent Method (GD)
Explain the gradient descent algorithm using the equations [HOML] p.114-115. and relate it to the code snippet
```python
X_b,y=GenerateData()
eta=0.1
n_iterations=1000
m=100
theta=np.random.randn(2,1)
foriterationinrange(n_iterations):
gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
theta=theta-eta*gradients
```
in the python code below.
As usual, avoid going top much into details of the code that does the plotting.
What role does `eta` play, and what happens if you increase/decrease it (explain the three plots)?
%% Cell type:code id: tags:
``` python
# TODO: Qa...examine the method (without the plotting)
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb
%matplotlibinline
importmatplotlibasmpl
importmatplotlib.pyplotasplt
importnumpyasnp
fromsklearn.linear_modelimportLinearRegression
defGenerateData():
X=2*np.random.rand(100,1)
y=4+3*X+np.random.randn(100,1)
X_b=np.c_[np.ones((100,1)),X]# add x0 = 1 to each instance
### Qb The Stochastic Gradient Descent Method (SGD)
Now, introducing the _stochastic_ variant of gradient descent, explain the stochastic nature of the SGD, and comment on the difference to the _normal_ gradient descent method (GD) we just saw.
Also explain the role of the calls to `np.random.randint()` in the code,
HINT: In detail, the important differences are, that the main loop for SGC is
```python
forepochinrange(n_epochs):
foriinrange(m):
.
.
.
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=...
theta=...
```
where it for the GD method was just
```python
foriterationinrange(n_iterations):
gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
theta=..
```
NOTE: the call `np.random.seed(42)` resets the random generator so that it produces the same random-sequence when re-running the code.
%% Cell type:code id: tags:
``` python
# TODO: Qb...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
There is also an adaptive learning rate method in the demo code for the SGD.
Explain the effects of the `learning_schedule()` functions.
You can set the learning rate parameter (also known as a hyperparameter) in may ML algorithms, say for SGD regression, to a method of your choice
```python
SGDRegressor(max_iter=1,
eta0=0.0005,
learning_rate="constant",# or 'adaptive' etc.
random_state=42)
```
but as usual, there is a bewildering array of possibilities...we will tackle this problem later when searching for the optimal hyperparameters.
NOTE: the `learning_schedule()` method could also have been used in the normal SG algorithm; is not directly part of the stochastic method, but a concept in itself.
%% Cell type:code id: tags:
``` python
# TODO: Qc...in text
```
%% Cell type:markdown id: tags:
### Qd Mini-batch Gradient Descent Method
Finally explain what a __mini-batch__ SG method is, and how it differs from the two others.
Again, take a peek into the demo code below, to extract the algorithm details...and explain the __main differences__, compared with the GD and SGD.
%% Cell type:code id: tags:
``` python
# TODO: Qd...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
Can you extend the `MyRegressor` class from the previous notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method.
You could add a parameter for the class, indicating it what mode it should be operating: analytical closed-form or numerical, like
The goal of the linear regression is to find the argument $w$ that minimizes the sum-of-squares error over all inputs.
Given the usual ML input data matrix $\mathbf X$ of size $(n,d)$ where each row is an input column vector $(\mathbf{x}^{(i)})^\top$ data sample of size $d$
$$
\newcommand\rem[1]{}
\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}
and $\by$ is the target output column vector of size $n$
$$
\by =
\ac{c}{
y\pown{1} \\
y\pown{2} \\
\vdots \\
y\pown{n} \\
}
$$
The linear regression model, via its hypothesis function and for a column vector input $\bx\powni$ of size $d$ and a column weight vector $\bw$ of size $d+1$ (with the additional element $w_0$ being the bias), can now be written as simple as
using the model parameters or weights, $\bw$, aka $\btheta$. To ease notation $\bx$ is assumed to have the 1 element prepended in the following so that $\bx$ is a $d+1$ column vector
$$
\ar{rl}{
\ac{c}{1\\\bx\powni} &\mapsto \bx\powni, ~~~~\mbox{by convention in the following...}\\
h(\bx\powni;\bw) &= \bw^\top \bx\powni
}
$$
This is actually the first fully white-box machine learning algorithm, that we see. All the glory details of the algorithm are clearly visible in the internal vector multiplication...quite simple, right? Now we just need to train the weights...
### Loss or Objective Function - Formulation for Linear Regression
The individual cost (or loss), $L\powni$, for a single input-vector $\bx\powni$ is a measure of how the model is able to fit the data: the higher the $L\powni$ value the worse it is able to fit. A loss of $L=0$ means a perfect fit.
It can be given by, say, the square difference from the calculated output, $h$, to the desired output, $y$