Linear Regression 1. Notations Let p ∈ N ∗ p\in\mathbb{N}^* p ∈ N ∗ be the number of features Let n ∈ N ∗ n\in\mathbb{N}^* n ∈ N ∗ be the number of observations Let ( X 1 , Y 1 ) , … , ( X n , Y n ) ∈ R p × R (X_1,Y_1),\dots ,(X_n,Y_n)\in\mathbb{R}^p\times \mathbb{R} ( X 1 , Y 1 ) , … , ( X n , Y n ) ∈ R p × R be the observations Let X = ( 1 X 1 T 1 X 2 T ⋮ 1 X n T ) , Y = ( Y 1 Y 2 ⋮ Y n ) X=\begin{pmatrix}1&X_1^T\\1&X_2^T\\\vdots\\1&X_n^T\end{pmatrix},Y=\begin{pmatrix}Y_1\\Y_2\\ \vdots \\Y_n\end{pmatrix} X = ⎝ ⎛ 1 1 ⋮ 1 X 1 T X 2 T X n T ⎠ ⎞ , Y = ⎝ ⎛ Y 1 Y 2 ⋮ Y n ⎠ ⎞ We want to find the best vector β ∈ R p + 1 \beta\in\mathbb{R}^{p+1} β ∈ R p + 1 that minimizes the distance:
∥ Y − X β ∥ 2 \lVert Y-X\beta \rVert_2 ∥ Y − Xβ ∥ 2 2. Value of β \beta β 2.1 General Case minimizing ∥ Y − X β ∥ 2 \lVert Y-X\beta \rVert_2 ∥ Y − Xβ ∥ 2 is the same as minimizing ∥ Y − X β ∥ 2 2 . \lVert Y-X\beta \rVert_2^2. ∥ Y − Xβ ∥ 2 2 . So we will approach the second problem.
First of all, we search for all values annihilating the derivative
∂ ∥ Y − X β ∥ 2 2 ∂ β = ∂ ∥ Y − X β ∥ 2 2 ∂ ( Y − X β ) ∂ ( Y − X β ) ∂ β = 2 ( Y − X β ) T X ∂ ∥ Y − X β ∥ 2 2 ∂ β = 0 ⟺ 2 ( Y − X β ) T X = 0 ⟺ X T ( Y − X β ) = 0 ⟺ X T X β = X T Y ⟹ β = ( X T X ) + X T Y is a solution, where . + is the pseudoinverse ⟹ β = X + Y is a solution \begin{align*} \frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial\beta}&=\frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial (Y-X\beta)}\frac{\partial (Y-X\beta)}{\partial\beta}\\&=2(Y-X\beta)^TX\\ \frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial\beta}=0&\iff 2(Y-X\beta)^TX=0\\ &\iff X^T(Y-X\beta)=0\\ &\iff X^TX\beta=X^TY\\ &\implies \beta=(X^TX)^+X^TY \text{ is a solution, where }.^+ \text{ is the pseudoinverse} \\ &\implies \beta=X^+Y \text{ is a solution} \end{align*} ∂ β ∂ ∥ Y − Xβ ∥ 2 2 ∂ β ∂ ∥ Y − Xβ ∥ 2 2 = 0 = ∂ ( Y − Xβ ) ∂ ∥ Y − Xβ ∥ 2 2 ∂ β ∂ ( Y − Xβ ) = 2 ( Y − Xβ ) T X ⟺ 2 ( Y − Xβ ) T X = 0 ⟺ X T ( Y − Xβ ) = 0 ⟺ X T Xβ = X T Y ⟹ β = ( X T X ) + X T Y is a solution, where . + is the pseudoinverse ⟹ β = X + Y is a solution ∥ Y − X β ∥ 2 2 \lVert Y-X\beta \rVert_2^2 ∥ Y − Xβ ∥ 2 2 is a convex quadratic. So this value β \beta β is a value minimizing the distance.
β = ( X T X ) + X T Y \boxed{\beta=(X^TX)^+X^TY} β = ( X T X ) + X T Y 2.2 Probabilistic Approach: p = 1 p=1 p = 1 For this case, we can even approach the problem probabilisticly.
We will treat y y y and x x x as random variables.
Let β 0 , β 1 ∈ R / y = β 0 + β 1 x + ε \beta_0,\beta_1\in\mathbb{R}/\quad y=\beta_0+\beta_1 x+\varepsilon β 0 , β 1 ∈ R / y = β 0 + β 1 x + ε We will assume that ε ∼ N ( 0 , σ 2 ) \varepsilon \sim \mathcal{N}(0,\sigma^2) ε ∼ N ( 0 , σ 2 ) is independent of x x x We have:
{ Cov [ x , y ] = β 1 V [ x ] E [ y ] = β 0 + β 1 E [ x ] ⟹ { β 1 = Cov [ x , y ] V [ x ] β 0 = E [ y ] − β 1 E [ x ] \begin{cases} \text{Cov}[x,y]=\beta_1\mathbb{V}[x]\\ \mathbb{E}[y]=\beta_0+\beta_1\mathbb{E}[x] \end{cases} \implies \begin{cases} \beta_1=\frac{\text{Cov}[x,y]}{\mathbb{V}[x]}\\ \beta_0=\mathbb{E}[y]-\beta_1\mathbb{E}[x] \end{cases} { Cov [ x , y ] = β 1 V [ x ] E [ y ] = β 0 + β 1 E [ x ] ⟹ { β 1 = V [ x ] Cov [ x , y ] β 0 = E [ y ] − β 1 E [ x ] This approach can be extended for p > 1. p>1. p > 1. But this is beyond this scope.
2.3 Probabilstic Approach: General Case We will treat y , x 1 , … , x p y,x_1,\dots,x_p y , x 1 , … , x p as random variables.
Let x = ( x 1 ⋮ x p ) \bold{x}=\begin{pmatrix}x_1 \\ \vdots \\ x_p\end{pmatrix} x = ⎝ ⎛ x 1 ⋮ x p ⎠ ⎞ Let β ∈ R p , β 0 ∈ R / y = ⟨ β , x ⟩ + β 0 + ε \beta \in\mathbb{R}^p,\beta_0\in\mathbb{R}/ y=\langle\beta,\bold{x}\rangle+\beta_0 + \varepsilon β ∈ R p , β 0 ∈ R / y = ⟨ β , x ⟩ + β 0 + ε Furthermore, we will assume that: ε ∼ N ( 0 , σ 2 ) \varepsilon \sim \mathcal{N}(0,\sigma^2) ε ∼ N ( 0 , σ 2 ) is independent of all x i x_i x i Let C = E [ ( x − E [ x ] ) ( x − E [ x ] ) T ] C=\mathbb{E}\left[\left(\bold{x}-\mathbb{E}[\bold{x}]\right)\left(\bold{x}-\mathbb{E}[\bold{x}]\right)^T\right] C = E [ ( x − E [ x ] ) ( x − E [ x ] ) T ] be the covariance matrix of x \bold{x} x Let w = ( Cov [ x 1 , y ] ⋮ Cov [ x p , y ] ) w=\begin{pmatrix}\text{Cov}[x_1,y]\\ \vdots \\ \text{Cov}[x_p,y]\end{pmatrix} w = ⎝ ⎛ Cov [ x 1 , y ] ⋮ Cov [ x p , y ] ⎠ ⎞ the cross-covariance between x \bold{x} x and y . y. y . First of all, we will calculate β \beta β
∀ i ∈ { 1 , … , p } , Cov [ x i , y ] = ∑ j = 1 p β j Cov [ x i , x j ] ⟺ C β = w ⟹ β = C + w is a solution \begin{align*} \forall i\in\{1,\dots,p\},\quad\text{Cov}[x_i,y]&=\sum_{j=1}^p\beta_j\text{Cov}[x_i,x_j]\\ \iff C\beta&=w \\ \implies \beta&=C^+w \text{ is a solution} \end{align*} ∀ i ∈ { 1 , … , p } , Cov [ x i , y ] ⟺ Cβ ⟹ β = j = 1 ∑ p β j Cov [ x i , x j ] = w = C + w is a solution For β 0 \beta_0 β 0
β 0 = E [ y ] − ⟨ β , E [ x ] ⟩ \beta_0=\mathbb{E}[y]-\langle\beta,\mathbb{E}[\bold{x}]\rangle β 0 = E [ y ] − ⟨ β , E [ x ]⟩ As a conclusion:
{ β = C + w β 0 = E [ y ] − ⟨ β , E [ x ] ⟩ \boxed{\begin{cases}\beta=C^+w\\ \beta_0=\mathbb{E}[y]-\langle \beta,\mathbb{E}[\bold{x}]\rangle\end{cases}} { β = C + w β 0 = E [ y ] − ⟨ β , E [ x ]⟩ Now the knowledge of β 0 , β \beta_0,\beta β 0 , β requires the explicit knowledge of C , w , E [ x ] , E [ y ] . C,w,\mathbb{E}[\bold{x}],\mathbb{E}[y]. C , w , E [ x ] , E [ y ] . which almost all of the time is not the case.
So we will estimate β 0 , β \beta_0,\beta β 0 , β by estimating those statistical parameters:
{ β ^ = C ^ + w ^ β 0 ^ = μ ^ ( y ) − ⟨ β ^ , μ ^ ( x ) ⟩ \boxed{\begin{cases}\hat{\beta}=\hat{C}^+\hat{w}\\ \hat{\beta_0}=\hat{\mu}(y)-\langle \hat{\beta},\hat{\mu}(\bold{x})\rangle\end{cases}} { β ^ = C ^ + w ^ β 0 ^ = μ ^ ( y ) − ⟨ β ^ , μ ^ ( x )⟩ If we have n n n independent samples of ( x , y ) : ( x 1 , y 1 ) , … , ( x n , y n ) (\bold{x},y):\quad(\bold{x}_1,y_1),\dots,(\bold{x_n},y_n) ( x , y ) : ( x 1 , y 1 ) , … , ( x n , y n ) treated as random variables.
And if we use the appropriate estimators, this formula reduces to Linear Regression.
3. Significance of the model Let's call M \mathcal{M} M our linear model
Assumption: the relation between y y y and x \bold{x} x is linear
We will use the F \mathcal{F} F -test.
3.1 Null Hyptothesis: H 0 : β i = 0 ∀ i > 0 H_0:\beta_i=0\quad\forall i>0 H 0 : β i = 0 ∀ i > 0 This null hyptothesis implies that y y y is a constant function of x \bold{x} x
We will statistically test this hypothesis using ANOVA
3.2 ANOVA Theorem If the null hypothesis is true then:
Z = ( y − y ˉ ) T ( y − y ˉ ) p ( y − ⟨ β , x ⟩ − β 0 ) T ( y − ⟨ β , x ⟩ − β 0 ) n − p − 1 ∼ F ( p , n − 1 − p ) Z=\frac{\tfrac{(y- \bar{y})^T(y- \bar{y})}{p}}{\tfrac{(y- \langle\beta,\bold{x}\rangle-\beta_0)^T(y- \langle\beta,\bold{x}\rangle-\beta_0)}{n-p-1}}\sim\mathcal{F}(p,n-1-p) Z = n − p − 1 ( y − ⟨ β , x ⟩ − β 0 ) T ( y − ⟨ β , x ⟩ − β 0 ) p ( y − y ˉ ) T ( y − y ˉ ) ∼ F ( p , n − 1 − p ) Let:
{ FSS = ∑ i = 1 n ( y i ∗ − y ˉ ) 2 RSS = ∑ i = 1 n ( y − y i ∗ ) 2 TSS = ∑ i = 1 n ( y − y ˉ ) = FSS + RSS \begin{cases}\text{FSS} = \sum_{i=1}^n(y^*_i-\bar{y})^2\\ \text{RSS} = \sum_{i=1}^n(y-y_i^*)^2\\ \text{TSS} = \sum_{i=1}^n(y-\bar{y}) = \text{FSS} +\text{RSS} \end{cases} ⎩ ⎨ ⎧ FSS = ∑ i = 1 n ( y i ∗ − y ˉ ) 2 RSS = ∑ i = 1 n ( y − y i ∗ ) 2 TSS = ∑ i = 1 n ( y − y ˉ ) = FSS + RSS We say we reject the null hypothesis within a confidence interval of ( 1 − p ) % (1-p)\% ( 1 − p ) % if:
{ f = FSS RSS p = P ( Z ≥ f ) \begin{cases} f=\tfrac{\text{FSS}}{\text{RSS}}\\ p=\mathcal{P}(Z\ge f) \end{cases} { f = RSS FSS p = P ( Z ≥ f ) 3.3 Significance Assuming a linear dependence between the variables, this result suggests that within a confidence of ( 1 − p ) % , (1-p)\%, ( 1 − p ) % , y y y is not a constant function of x . \bold{x}. x .
4. Confidence interval of the prediction Assumption: the relation between y y y and x \bold{x} x is linear
We will use student's t t t -test.
4.1 Confidence Interval of parameters Let β ^ \hat{\beta} β ^ be an estimator of β . \beta. β . We have:
∀ i ∈ { 0 , … , p } , T i = β ^ i − β i σ ^ ∗ 2 ( ( X T X ) − 1 ) i , i ∼ T n − 1 − p \boxed{\forall i\in\{0,\dots,p\}\quad, T_i=\frac{\hat{\beta}_i-\beta_i}{\hat{\sigma}_*^2\sqrt{\left(\left(X^TX\right)^{-1}\right)_{i,i}}}\sim \mathcal{T}_{n-1-p}} ∀ i ∈ { 0 , … , p } , T i = σ ^ ∗ 2 ( ( X T X ) − 1 ) i , i β ^ i − β i ∼ T n − 1 − p Where σ ^ ∗ 2 \hat{\sigma}^2_* σ ^ ∗ 2 is an unbiased estimation of σ 2 = V [ y ∗ ] \sigma^2=\mathbb{V}[y^*] σ 2 = V [ y ∗ ] where y ∗ = ⟨ β , x ⟩ + β 0 = y − ε y^*=\langle \beta,\bold{x}\rangle+\beta_0=y-\varepsilon y ∗ = ⟨ β , x ⟩ + β 0 = y − ε . It is equal to:
σ ^ ∗ 2 = RSS n − 1 − p \boxed{\hat{\sigma}^2_*=\frac{\text{RSS}}{n-1-p}} σ ^ ∗ 2 = n − 1 − p RSS For i ∈ { 0 , … , p } i\in\{0,\dots,p\} i ∈ { 0 , … , p } Let t ∈ R + t\in \mathbb{R}_+ t ∈ R + Let γ = ∈ R + / P ( ∣ T i ∣ ≥ t ) = γ 2 \gamma=\in\mathbb{R}_+/\quad \mathcal{P}(\lvert T_i\rvert \ge t)=\frac{\gamma}{2} γ =∈ R + / P (∣ T i ∣ ≥ t ) = 2 γ We say that β i = β ^ i ± t σ ^ ∗ 2 ( ( X T X ) − 1 ) i , i \beta_i=\hat{\beta}_i\pm t\hat{\sigma}^2_*\sqrt{\left(\left(X^TX\right)^{-1}\right)_{i,i}} β i = β ^ i ± t σ ^ ∗ 2 ( ( X T X ) − 1 ) i , i within a confidence interval of ( 1 − γ ) % . (1-\gamma)\%. ( 1 − γ ) %.
4.2 Confidence Interval of prediction y ∗ = y ^ ∗ + t σ ^ ∗ 2 ( 1 x ) T ( X T X ) − 1 ( 1 x ) = ⟨ β , x ⟩ + β 0 + t σ ^ ∗ 2 ( 1 x ) T ( X T X ) − 1 ( 1 x ) \boxed{y^*= \hat{y}^*+t\hat{\sigma}^2_*\sqrt{\begin{pmatrix}1\\\bold{x}\end{pmatrix}^T\left(X^TX\right)^{-1}\begin{pmatrix}1\\\bold{x}\end{pmatrix}}=\langle\beta,\bold{x}\rangle+\beta_0+t\hat{\sigma}^2_*\sqrt{\begin{pmatrix}1\\\bold{x}\end{pmatrix}^T\left(X^TX\right)^{-1}\begin{pmatrix}1\\\bold{x}\end{pmatrix}}} y ∗ = y ^ ∗ + t σ ^ ∗ 2 ( 1 x ) T ( X T X ) − 1 ( 1 x ) = ⟨ β , x ⟩ + β 0 + t σ ^ ∗ 2 ( 1 x ) T ( X T X ) − 1 ( 1 x ) 4.3 Case of simple regression: p = 1 p=1 p = 1 The confidence interval of β 0 \beta_0 β 0 is:
β 0 = β ^ 0 ± t ∑ i = 1 n ( y − y ˉ ) 2 ∑ i = 1 n ( x − x ˉ ) 2 = β ^ 0 ± t ss ^ ( y ) ss ( x ) \beta_0=\hat{\beta}_0\pm t\frac{\sum_{i=1}^{n}(y-\bar{y})^2}{\sqrt{\sum_{i=1}^{n}(x-\bar{x})^2}}=\hat{\beta}_0\pm t\frac{\hat{\text{ss}}(y)}{\sqrt{\text{ss}(x)}} β 0 = β ^ 0 ± t ∑ i = 1 n ( x − x ˉ ) 2 ∑ i = 1 n ( y − y ˉ ) 2 = β ^ 0 ± t ss ( x ) ss ^ ( y ) The confidence interval of β 1 \beta_1 β 1 is:
β 1 = β ^ 1 ± t ss ( y ) 1 n + x ˉ 2 ss ^ ( x ) \beta_1=\hat{\beta}_1\pm t \text{ss}(y)\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\hat{\text{ss}}(x)}} β 1 = β ^ 1 ± t ss ( y ) n 1 + ss ^ ( x ) x ˉ 2 The confidence interval of the a new prediction y ∗ y^* y ∗ using x 0 x_0 x 0 is:
y ∗ = β 1 x 0 + β 0 ± t ss ( y ) 1 n + ( x 0 − x ˉ ) 2 ss ^ ( x ) y^*=\beta_1x_0+\beta_0\pm t \text{ss}(y)\sqrt{\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\hat{\text{ss}}(x)}} y ∗ = β 1 x 0 + β 0 ± t ss ( y ) n 1 + ss ^ ( x ) ( x 0 − x ˉ ) 2