Commit 98167d98 by Carsten Eie Frigaard

### update

parent 3d77b222

7.98 KB | W: | H:

34.1 KB | W: | H:

• 2-up
• Swipe
• Onion skin
 %% Cell type:markdown id: tags: # ITMAL Exercise ## The E.T. and Gradient Descent for Linear Regression The Good News is that aliens are here, yes, really! The Bad News is that You are assigned teacher for one Extra-Terrestial _creature_. The Good News is that aliens are here, yes, really, and You are are assigned teacher for one Extra-Terrestial _creature_. Your task is to create a document (journal) for the _creature_, explaining all about Gradient Descent related to the linear regression model. The _creature_ needs about four, max. six normal pages of text otherwise it becomes very grumpy like a Gremlin. Other students and causes will be teaching matrix/vector notation, linear algebra, art, poetry, politics, war, economics and explaining the humor in The Simpson's etc. You should concentrate on learning the _creature_ all about Gradient Decent for Linear Regression! However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit. The _creature_ needs about four, max. six normal pages of text (one normal page = 2400 chars including spaces). More than six pages is like pouring water on a Mogwai (=> turns into a grumpy Gremlin). However, the _creature_ likes reading code and math and examining beutifull plots, so code sections, math and plots do not count into the max normal-page limit. It is fluent in Danish as well as English. As you job of being an E.T.-teacher, You must cover Gradient Decent for a simple Linear Regression model with at least the following concepts: * Linear Regression model prediction in vectorized form * MSE cost function for a Linear Regression model * Closed-form solution (normal equation) * Numerical gradient decent * Batch Gradient Descent * Stochastic Gradient Descent * Learning rates Feel free to add additional Gradient Decent concepts, but remember to keep the text You submit below max. six pages (exluding plot, code and math). Note that you could peek into the other notebooks for this lesson, copying math, code, and plots from these are allowed. (Once submitted as a hand-in in Brightspace, I will forward it to the E.T., but expect no direct feedback from the _creature_..) %% Cell type:code id: tags:  python # TODO: Your GD documentation for the E.T.  %% Cell type:markdown id: tags: REVISIONS| | ---------| | 2021-0926| CEF, initial. 2021-0927| CEF, elaborated on the story-telling element. ... ...
 %% Cell type:markdown id: tags: # ITMAL Exercise ## Gradient Descent Methods and Training Finding the optimal solution in one-step, via $$\newcommand\rem[1]{} \rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs} \newcommand\eq[2]{#1 &=& #2\\} \newcommand\ar[2]{\begin{array}{#1}#2\end{array}} \newcommand\ac[2]{\left[\ar{#1}{#2}\right]} \newcommand\st[1]{_{\mbox{\scriptsize #1}}} \newcommand\norm[1]{{\cal L}_{#1}} \newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}} \newcommand\diff[1]{\mbox{d}#1} \newcommand\pown[1]{^{(#1)}} \def\pownn{\pown{n}} \def\powni{\pown{i}} \def\powtest{\pown{\mbox{\scriptsize test}}} \def\powtrain{\pown{\mbox{\scriptsize train}}} \def\bX{\mathbf{M}} \def\bX{\mathbf{X}} \def\bZ{\mathbf{Z}} \def\bw{\mathbf{m}} \def\bx{\mathbf{x}} \def\by{\mathbf{y}} \def\bz{\mathbf{z}} \def\bw{\mathbf{w}} \def\btheta{{\boldsymbol\theta}} \def\bSigma{{\boldsymbol\Sigma}} \def\half{\frac{1}{2}} \newcommand\pfrac[2]{\frac{\partial~#1}{\partial~#2}} \newcommand\dfrac[2]{\frac{\mbox{d}~#1}{\mbox{d}#2}} \bw^* ~=~ \left( \bX^\top \bX \right)^{-1} \bX^\top \by$$ has its downsides: the scaling problem of the matrix inverse. Now, let us look at a numerical solution to the problem of finding the value of $\bw$ (aka $\btheta$) that minimizes the objective function $J$. Again, ideally we just want to find places, where the (multi-dimensionally) gradient of $J$ is zero (here using a constant factor $\frac{2}{m}$) $$\ar{rl}{ \nabla_\bw J(\bw) &= \frac{2}{m} \bX^\top \left( \bX \bw - \by \right)\\ }$$ and numerically we calculate $\nabla_{\bw} J$ for a point in $\bw$-space, and then move along in the opposite direction of this gradient, taking a step of size $\eta$ $$\bw^{(step~N+1)} = \bw^{(step~N)} - \eta \nabla_{\bw} J(\bw)$$ That's it, pretty simple, right (apart from numerical stability, problem with convergence and regularization, that we will discuss later). So, we begin with some initial $\bw$, and iterate via the equation above, towards places, where $J$ is smaller, and this can be illustrated as If we hit the/a global minimum or just a local minimum (or in extremely rare cases a local saddle point) is another question when not using a simple linear regression model: for non-linear models we will in general not see a nice convex $J$-$\bw$ surface, as in the figure above. ### Qa The Gradient Descent Method (GD) Explain the gradient descent algorithm using the equations [HOML] p.114-115. and relate it to the code snippet python X_b, y = GenerateData() eta = 0.1 n_iterations = 1000 m = 100 theta = np.random.randn(2,1) for iteration in range(n_iterations): gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = theta - eta * gradients  in the python code below. As usual, avoid going top much into details of the code that does the plotting. What role does eta play, and what happens if you increase/decrease it (explain the three plots)? %% Cell type:code id: tags:  python # TODO: Qa...examine the method (without the plotting) # NOTE: modified code from [GITHOML], 04_training_linear_models.ipynb %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression def GenerateData(): X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance return X, X_b, y X, X_b, y = GenerateData() eta = 0.1 n_iterations = 1000 m = 100 theta = np.random.randn(2,1) for iteration in range(n_iterations): gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = theta - eta * gradients print(f'stochastic gradient descent theta={theta.ravel()}') ########################################################## # rest of the code is just for plotting, needs no review def plot_gradient_descent(theta, eta, theta_path=None): m = len(X_b) plt.plot(X, y, "b.") n_iterations = 1000 for iteration in range(n_iterations): if iteration < 10: y_predict = X_new_b.dot(theta) style = "b-" if iteration > 0 else "r--" plt.plot(X_new, y_predict, style) gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = theta - eta * gradients if theta_path is not None: theta_path.append(theta) plt.xlabel("$x_1$", fontsize=18) plt.axis([0, 2, 0, 15]) plt.title(r"$\eta = {}$".format(eta), fontsize=16) np.random.seed(42) theta_path_bgd = [] theta = np.random.randn(2,1) # random initialization X_new = np.array([[0], [2]]) X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance plt.figure(figsize=(10,4)) plt.subplot(131); plot_gradient_descent(theta, eta=0.02) plt.ylabel("$y$", rotation=0, fontsize=18) plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd) plt.subplot(133); plot_gradient_descent(theta, eta=0.5) plt.show() print('OK')  %%%% Output: stream stochastic gradient descent theta=[4.15435925 2.94177079] %%%% Output: display_data ![]() %%%% Output: stream OK %% Cell type:markdown id: tags: ### Qb The Stochastic Gradient Descent Method (SGD) Now, introducing the _stochastic_ variant of gradient descent, explain the stochastic nature of the SGD, and comment on the difference to the _normal_ gradient descent method (GD) we just saw. Also explain the role of the calls to np.random.randint() in the code, HINT: In detail, the important differences are, that the main loop for SGC is python for epoch in range(n_epochs): for i in range(m): . . . gradients = 2 * xi.T.dot(xi.dot(theta) - yi) eta = ... theta = ...  where it for the GD method was just python for iteration in range(n_iterations): gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = ..  NOTE: the call np.random.seed(42) resets the random generator so that it produces the same random-sequence when re-running the code. %% Cell type:code id: tags:  python # TODO: Qb...run this code # NOTE: code from [GITHOML], 04_training_linear_models.ipynb theta_path_sgd = [] m = len(X_b) np.random.seed(42) n_epochs = 50 t0, t1 = 5, 50 # learning schedule hyperparameters def learning_schedule(t): return t0 / (t + t1) theta = np.random.randn(2,1) # random initialization for epoch in range(n_epochs): for i in range(m): if epoch == 0 and i < 20: y_predict = X_new_b.dot(theta) style = "b-" if i > 0 else "r--" plt.plot(X_new, y_predict, style) random_index = np.random.randint(m) xi = X_b[random_index:random_index+1] yi = y[random_index:random_index+1] gradients = 2 * xi.T.dot(xi.dot(theta) - yi) eta = learning_schedule(epoch * m + i) theta = theta - eta * gradients theta_path_sgd.append(theta) plt.plot(X, y, "b.") from sklearn.linear_model import SGDRegressor sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty=None, eta0=0.1, random_state=42) sgd_reg.fit(X, y.ravel()) print(f'stochastic gradient descent theta={theta.ravel()}') print(f'Scikit-learn SGDRegressor "thetas": sgd_reg.intercept_={sgd_reg.intercept_}, sgd_reg.coef_={sgd_reg.coef_}') ########################################################## # rest of the code is just for plotting, needs no review plt.xlabel("$x_1$", fontsize=18) plt.ylabel("$y$", rotation=0, fontsize=18) plt.axis([0, 2, 0, 15]) plt.show() print('OK')  %%%% Output: stream stochastic gradient descent theta=[3.84208754 3.03831083] Scikit-learn SGDRegressor "thetas": sgd_reg.intercept_=[3.83534891], sgd_reg.coef_=[2.96948592] %%%% Output: display_data ![]() %%%% Output: stream OK %% Cell type:markdown id: tags: ### Qc Adaptive learning rate for $\eta$ There is also an adaptive learning rate method in the demo code for the SGD. Explain the effects of the learning_schedule() functions. You can set the learning rate parameter (also known as a hyperparameter) in may ML algorithms, say for SGD regression, to a method of your choice python SGDRegressor(max_iter=1, eta0=0.0005, learning_rate="constant", # or 'adaptive' etc. random_state=42)  but as usual, there is a bewildering array of possibilities...we will tackle this problem later when searching for the optimal hyperparameters. NOTE: the learning_schedule() method could also have been used in the normal SG algorithm; is not directly part of the stochastic method, but a concept in itself. %% Cell type:code id: tags:  python # TODO: Qc...in text  %% Cell type:markdown id: tags: ### Qd Mini-batch Gradient Descent Method Finally explain what a __mini-batch__ SG method is, and how it differs from the two others. Again, take a peek into the demo code below, to extract the algorithm details...and explain the __main differences__, compared with the GD and SGD. %% Cell type:code id: tags:  python # TODO: Qd...run this code # NOTE: code from [GITHOML], 04_training_linear_models.ipynb theta_path_mgd = [] n_iterations = 50 minibatch_size = 20 np.random.seed(42) theta = np.random.randn(2,1) # random initialization t0, t1 = 200, 1000 def learning_schedule(t): return t0 / (t + t1) t = 0 for epoch in range(n_iterations): shuffled_indices = np.random.permutation(m) X_b_shuffled = X_b[shuffled_indices] y_shuffled = y[shuffled_indices] for i in range(0, m, minibatch_size): t += 1 xi = X_b_shuffled[i:i+minibatch_size] yi = y_shuffled[i:i+minibatch_size] gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi) eta = learning_schedule(t) theta = theta - eta * gradients theta_path_mgd.append(theta) print(f'mini-batch theta={theta.ravel()}') print('OK')  %%%% Output: stream mini-batch theta=[3.86814831 3.02878845] OK %% Cell type:markdown id: tags: ### Qe Choosing a Gradient Descent Method Explain the $θ_0−θ_1$ plot below, and make a comment on when to use GD/SGD/mini-batch gradient descent (pros and cons for the different methods). %% Cell type:code id: tags:  python # TODO: Qd...run this code # NOTE: code from [GITHOML], 04_training_linear_models.ipynb theta_path_bgd = np.array(theta_path_bgd) theta_path_sgd = np.array(theta_path_sgd) theta_path_mgd = np.array(theta_path_mgd) plt.figure(figsize=(7,4)) plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic") plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch") plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch") plt.legend(loc="upper left", fontsize=16) plt.xlabel(r"$\theta_0$", fontsize=20) plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0) plt.axis([2.5, 4.5, 2.3, 3.9]) plt.show() print('OK')  %%%% Output: display_data ![]() %%%% Output: stream OK %% Cell type:markdown id: tags: ### [OPTIONAL] Qf Extend the MyRegressor Can you extend the MyRegressor class from the previous notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method. NOTE: this excercise only possible if linear_regression_2.ipynb has been solve. Can you extend the MyRegressor class from the previous linear_regression_2.ipynb notebook, adding a numerical train method? Choose one of the gradient descent methods above...perhaps starting with a plain SG method. You could add a parameter for the class, indicating it what mode it should be operating: analytical closed-form or numerical, like python class MyRegressor(BaseEstimator): def __init__(self, numerical = False): self.__w = None self.__numerical_mode = numerical . . .  %% Cell type:code id: tags:  python # TODO: Qf...[OPTIONAL]  %% Cell type:markdown id: tags: REVISIONS| | ---------| | 2018-0214| CEF, initial. 2018-0214| CEF, added optional exe. 2018-0220| CEF, major update. 2018-0220| CEF, fixed revision table malformatting. 2018-0225| CEF, removed =0 in expression. 2018-0225| CEF, minor text updates and made Qc optional. 2018-0225| CEF, minor source code cleanups. 2021-0918| CEF, update to ITMAL E21. ... ...
 %% Cell type:markdown id: tags: # ITMAL Exercise ## Linear Regression II The goal of the linear regression was to find the argument $\mathbf w$ that minimizes the objective function, $J$ $$\newcommand\rem[1]{} \rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs} \newcommand\eq[2]{#1 &=& #2\\} \newcommand\ar[2]{\begin{array}{#1}#2\end{array}} \newcommand\ac[2]{\left[\ar{#1}{#2}\right]} \newcommand\st[1]{_{\mbox{\scriptsize #1}}} \newcommand\norm[1]{{\cal L}_{#1}} \newcommand\obs[2]{#1_{\mbox{\scriptsize obs}}^{\left(#2\right)}} \newcommand\diff[1]{\mbox{d}#1} \newcommand\pown[1]{^{(#1)}} \def\pownn{\pown{n}} \def\powni{\pown{i}} \def\powtest{\pown{\mbox{\scriptsize test}}} \def\powtrain{\pown{\mbox{\scriptsize train}}} \def\bX{\mathbf{M}} \def\bX{\mathbf{X}} \def\bZ{\mathbf{Z}} \def\bw{\mathbf{m}} \def\bx{\mathbf{x}} \def\by{\mathbf{y}} \def\bz{\mathbf{z}} \def\bw{\mathbf{w}} \def\btheta{{\boldsymbol\theta}} \def\bSigma{{\boldsymbol\Sigma}} \def\half{\frac{1}{2}} \newcommand\pfrac[2]{\frac{\partial~#1}{\partial~#2}} \newcommand\dfrac[2]{\frac{\mbox{d}~#1}{\mbox{d}#2}} \ar{rl}{ \bw^* &= \mbox{argmin}_\bw~J\\ &= \mbox{argmin}_\bw\frac{1}{2} ||\bX \bw - \by||_2^2 }$$ for some input data, and using the $\norm{2}^2$ or MSE internally in the loss function. To solve this equation in closed form (directly, without any numerical approximation), we found the optimal solution to be of the rather elegant least-square solution $$\bw^* ~=~ \left( \bX^\top \bX \right)^{-1} \bX^\top \by$$ Now, we want to build a python linear regression class that encapsulates this elegant closed-form solution. #### Qa Complete the estimator class MyRegressor This one will be based on a linear regression closed-form solution from the previous notebook (linear_regression_1.ipynb). Use your knowledge of how to create Python classes and the fit-predict Scikit-learn interface. The class hierarchy will look like Finalize or complete the fit(), predict() and mse() score functions in the MyRegressor class; there is already a good structure for it in the code below. Also, notice the implementation of the score() function in the class, that is similar to sklearn.linear_model.LinearRegression's score function, i.e. a $R^2$ score we saw in an earlier lesson. Use the test stub below, that creates some simple test data, similar to [HOML] p.108/110, and runs a fit-predict on your brand new linear regression closed-form estimator. %% Cell type:code id: tags: ` python # TODO: Qa, with some help import numpy as np from numpy.linalg import inv from sklearn.base import BaseEstimator from sklearn.metrics import r2_score from libitmal import utils as itmalutils class MyRegressor(BaseEstimator): def __init__(self): self.__w = None def weights(self): return self.__w def __addbiasterm(self, X: np.array): assert X.ndim==2 assert X.shape[0]>0, 'empty X array'