The goal of the linear regression is to find the argument $w$ that minimizes the sum-of-squares error over all inputs.

Given the usual ML input data matrix $\mathbf X$ of size $(n,d)$ where each row is an input column vector $(\mathbf{x}^{(i)})^\top$ data sample of size $d$

$$

\newcommand\rem[1]{}

\rem{ITMAL: CEF def and LaTeX commands, remember: no newlines in defs}

and $\by$ is the target output column vector of size $n$

$$

\by =

\ac{c}{

y\pown{1} \\

y\pown{2} \\

\vdots \\

y\pown{n} \\

}

$$

The linear regression model, via its hypothesis function and for a column vector input $\bx\powni$ of size $d$ and a column weight vector $\bw$ of size $d+1$ (with the additional element $w_0$ being the bias), can now be written as simple as

using the model parameters or weights, $\bw$, aka $\btheta$. To ease notation $\bx$ is assumed to have the 1 element prepended in the following so that $\bx$ is a $d+1$ column vector

$$

\ar{rl}{

\ac{c}{1\\\bx\powni} &\mapsto \bx\powni, ~~~~\mbox{by convention in the following...}\\

h(\bx\powni;\bw) &= \bw^\top \bx\powni

}

$$

This is actually the first fully white-box machine learning algorithm, that we see. All the glory details of the algorithm are clearly visible in the internal vector multiplication...quite simple, right? Now we just need to train the weights...

### Loss or Objective Function - Formulation for Linear Regression

The individual cost (or loss), $L\powni$, for a single input-vector $\bx\powni$ is a measure of how the model is able to fit the data: the higher the $L\powni$ value the worse it is able to fit. A loss of $L=0$ means a perfect fit.

It can be given by, say, the square difference from the calculated output, $h$, to the desired output, $y$

here using the squared Euclidean norm, $\norm{2}^2$, via the $||\cdot||_2^2$ expressions.

Now the factor $\frac{1}{n}$ is just a constant and can be ignored, yielding the total cost function

$$

\ar{rl}{

J &= \frac{1}{2} ||\bX \bw - \by||_2^2\\

&\propto \mbox{MSE}

}

$$

adding yet another constant, 1/2, to ease later differentiation of $J$.

### Training

Training the linear regression model now amounts to computing the optimal value of the $\bw$ weight; that is finding the $\bw$-value that minimizes the total cost

$$

\bw^* = \mbox{argmin}_\bw~J\\

$$

where $\mbox{argmin}_\bw$ means find the argument of $\bw$ that minimizes the $J$ function. This minimum (sometimes a maximum, via argmax) is denoted $\bw^*$ in most ML literature.

The minimization can in 2-D visually be drawn as finding the lowest $J$ that for linear regression always form a convex shape

<imgsrc="https://itundervisning.ase.au.dk/GITMAL/L05/Figs/minimization.png"alt="WARNING: image could not be fetched"style="height:240px">

### Training: The Closed-form Solution

To solve for $\bw^*$ in closed form (i.e. directly, without any numerical approximation), we find the gradient of $J$ with respect to $\bw$. Taking the partial deriverty $\partial/\partial_\bw$ of the $J$ via the gradient (nabla) operator

You already know this method from math, finding the extrema for a function, say

$$f(w)=w^2-2w-2$$

so is given by finding the place where the gradient $\mbox{d}~f(w)/\mbox{d}w = 0$

$$

\dfrac{f(w)}{w} = 2w -2 = 0

$$

so we see that there is an extremum at $w=1$. Checking the second deriverty tells if we are seeing a minimum, maximum or a saddlepoint at that point. In matrix terms, this corresponds to finding the _Hessian_ matrix and gets notational tricky due to the multiple feature dimensions involved.

#### Qa Write a Python function that uses the closed-form to find $\bw^*$

Use the test data, `X1` and `y1` in the code below to find `w1` via the closes-form. Use the test vectors for `w1` to test your implementation, and remember to add the bias term (concat an all-one vector to `X` before solving).

#### Qb Find the limits of the least-square method

Again find the least-square optimal value for `w2` now using `X2` and `y2` as inputs.

Describe the problem with the matrix inverse, and for what `M` and `N` combinations do you see, that calculation of the matrix inverse takes up long time?

%% Cell type:code id: tags:

``` python

# TODO: Qb...

# TEST DATA: Matrix, taken from [HOML], p108

M=100

N=1

print(f'More test data, M={10}, N={N}...')

X2=2*np.random.rand(M,N)

y2=4+3*X2+np.random.randn(M,1)

y2=y2[:,0]# well, could do better here!

assertFalse,"find the least-square solution for X2 and y2, again"

# w2 =

```

%% Cell type:markdown id: tags:

REVISIONS| |

---------| |

2018-1218| CEF, initial.

2018-0214| CEF, major update.

2018-0218| CEF, fixed error in nabla expression.

2018-0218| CEF, added minimization plot.

2018-0218| CEF, added note on argmin/max.

2018-0218| CEF, changed concave to convex.

2021-0926| CEF, update for ITMAL E21.

2011-1002| CEF, corrected page numbers for HOML v2 (109=>114).