## The E.T. and Gradient Descent for Linear Regression
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/et.jpg"alt="WARNING: image could not be fetched"style="height:250px">
The Good News is that aliens are here, yes, really! The Bad News is that You are assigned teacher for one Extra-Terrestial _creature_.
The Good News is that aliens are here, yes, really, and You are are assigned teacher for one Extra-Terrestial _creature_.
Your task is to create a document (journal) for the _creature_, explaining all about Gradient Descent related to the linear regression model.
The _creature_ needs about four, max. six normal pages of text otherwise it becomes very grumpy like a Gremlin.
Other students and causes will be teaching matrix/vector notation, linear algebra, art, poetry, politics, war, economics and explaining the humor in The Simpson's etc. You should concentrate on learning the _creature_ all about Gradient Decent for Linear Regression!
However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit.
The _creature_ needs about four, max. six normal pages of text (one normal page = 2400 chars including spaces). More than six pages is like pouring water on a <ahref='https://en.wikipedia.org/wiki/Gremlins'>Mogwai</a> (=> turns into a grumpy Gremlin).
However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit. It is fluent in Danish as well as English.
As you job of being an E.T.-teacher, You must cover Gradient Decent for a simple Linear Regression model with at least the following concepts:
* Linear Regression model prediction in vectorized form
* MSE cost function for a Linear Regression model
* Closed-form solution (normal equation)
* Numerical gradient decent
* Batch Gradient Descent
* Stochastic Gradient Descent
* Learning rates
Feel free to add additional Gradient Decent concepts, but remember to keep the text You submit below max. six pages (exluding plot, code and math).
Note that you could peek into the other notebooks for this lesson, copying math, code, and plots from these are allowed.
(Once submitted as a hand-in in Brightspace, I will forward it to the E.T., but expect no direct feedback from the _creature_..)
%% Cell type:code id: tags:
``` python
# TODO: Your GD documentation for the E.T.
```
%% Cell type:markdown id: tags:
REVISIONS| |
---------| |
2021-0926| CEF, initial.
2021-0927| CEF, elaborated on the story-telling element.
has its downsides: the scaling problem of the matrix inverse. Now, let us look at a numerical solution to the problem of finding the value of $\bw$ (aka $\btheta$) that minimizes the objective function $J$.
Again, ideally we just want to find places, where the (multi-dimensionally) gradient of $J$ is zero (here using a constant factor $\frac{2}{m}$)
and numerically we calculate $\nabla_{\bw} J$ for a point in $\bw$-space, and then move along in the opposite direction of this gradient, taking a step of size $\eta$
That's it, pretty simple, right (apart from numerical stability, problem with convergence and regularization, that we will discuss later).
So, we begin with some initial $\bw$, and iterate via the equation above, towards places, where $J$ is smaller, and this can be illustrated as
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization.png"alt="WARNING: image could not be fetched"style="height:240px">
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization_gd.png"alt="WARNING: image could not be fetched"style="height:240px">
If we hit the/a global minimum or just a local minimum (or in extremely rare cases a local saddle point) is another question when not using a simple linear regression model: for non-linear models we will in general not see a nice convex $J$-$\bw$ surface, as in the figure above.
### Qa The Gradient Descent Method (GD)
Explain the gradient descent algorithm using the equations [HOML] p.114-115. and relate it to the code snippet
```python
X_b,y=GenerateData()
eta=0.1
n_iterations=1000
m=100
theta=np.random.randn(2,1)
foriterationinrange(n_iterations):
gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
theta=theta-eta*gradients
```
in the python code below.
As usual, avoid going top much into details of the code that does the plotting.
What role does `eta` play, and what happens if you increase/decrease it (explain the three plots)?
%% Cell type:code id: tags:
``` python
# TODO: Qa...examine the method (without the plotting)
# NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb
%matplotlibinline
importmatplotlibasmpl
importmatplotlib.pyplotasplt
importnumpyasnp
fromsklearn.linear_modelimportLinearRegression
defGenerateData():
X=2*np.random.rand(100,1)
y=4+3*X+np.random.randn(100,1)
X_b=np.c_[np.ones((100,1)),X]# add x0 = 1 to each instance
### Qb The Stochastic Gradient Descent Method (SGD)
Now, introducing the _stochastic_ variant of gradient descent, explain the stochastic nature of the SGD, and comment on the difference to the _normal_ gradient descent method (GD) we just saw.
Also explain the role of the calls to `np.random.randint()` in the code,
HINT: In detail, the important differences are, that the main loop for SGC is
```python
forepochinrange(n_epochs):
foriinrange(m):
.
.
.
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=...
theta=...
```
where it for the GD method was just
```python
foriterationinrange(n_iterations):
gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
theta=..
```
NOTE: the call `np.random.seed(42)` resets the random generator so that it produces the same random-sequence when re-running the code.
%% Cell type:code id: tags:
``` python
# TODO: Qb...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
There is also an adaptive learning rate method in the demo code for the SGD.
Explain the effects of the `learning_schedule()` functions.
You can set the learning rate parameter (also known as a hyperparameter) in may ML algorithms, say for SGD regression, to a method of your choice
```python
SGDRegressor(max_iter=1,
eta0=0.0005,
learning_rate="constant",# or 'adaptive' etc.
random_state=42)
```
but as usual, there is a bewildering array of possibilities...we will tackle this problem later when searching for the optimal hyperparameters.
NOTE: the `learning_schedule()` method could also have been used in the normal SG algorithm; is not directly part of the stochastic method, but a concept in itself.
%% Cell type:code id: tags:
``` python
# TODO: Qc...in text
```
%% Cell type:markdown id: tags:
### Qd Mini-batch Gradient Descent Method
Finally explain what a __mini-batch__ SG method is, and how it differs from the two others.
Again, take a peek into the demo code below, to extract the algorithm details...and explain the __main differences__, compared with the GD and SGD.
%% Cell type:code id: tags:
``` python
# TODO: Qd...run this code
# NOTE: code from [GITHOML], 04_training_linear_models.ipynb
Can you extend the `MyRegressor` class from the previous notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method.
NOTE: this excercise only possible if `linear_regression_2.ipynb` has been solve.
Can you extend the `MyRegressor` class from the previous `linear_regression_2.ipynb` notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method.
You could add a parameter for the class, indicating it what mode it should be operating: analytical closed-form or numerical, like
for some input data, and using the $\norm{2}^2$ or MSE internally in the loss function.
To solve this equation in closed form (directly, without any numerical approximation), we found the optimal solution to be of the rather elegant least-square solution
<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/class_regression.png"alt="WARNING: image could not be fetched"style="height:350px">
Finalize or complete the `fit()`, `predict()` and `mse()` score functions in the `MyRegressor` class; there is already a good structure for it in the code below.
Also, notice the implementation of the `score()` function in the class, that is similar to `sklearn.linear_model.LinearRegression`'s score function, i.e. a $R^2$ score we saw in an earlier lesson.
Use the test stub below, that creates some simple test data, similar to [HOML] p.108/110, and runs a fit-predict on your brand new linear regression closed-form estimator.