## MACHINE LEARNING THEORY

## Linear methods for classification

As we try to classificate our data into distinct groups, our predictor **G(x)** takes values in a discrete set **ζ **and we can divide the input space into a collection of regions labeled according to the classification. With linear methods, we mean that the decision boundaries between our predicted classes are linear.

All the response categories have an indicator variable. Thus if **ζ **has** K **classes, there will be **K **such indicators **Yk: k = 1,…,k **with **Yk=1 **if **G=K**, else **0**, these are called dummy variables. All these variables are collected together in a vector **Y = (Y1, Y2, …, Yk)**, and form an **NxK** matrix where **N** is the number of training instances.

Then we fit a linear regression to all Y columns simultaneously:

A new observation is classified as follows:

- Compute the output vector using the fitted linear regression
- Identify the largest component and classify accordingly.

Using this we construct a target ** tk **for each class, where

**is the**

*tk***column of the**

*kth***identity matrix. So we have to reproduce the target of every observation. We can solve is using least-squares as:**

*KxK*As you may notice, the minimization problem is precisely the same that we use for multiple response linear regression.

Here we find a problem, when K>2, the rigidity of the model implies that classes are easily masked by others, and it gets bigger as K grows. To solve this problem we can try to implement polynomial transformations to our input variables, but it will do the trick sometimes, not always. And the model will gain a lot of complexity.

Suppose **f k(x)** is the class-conditional density

**P**of

**X**

*in class*

*G=k**, and let*

**be the prior**

*πk***probability of class**

*K***, being the sum of all**

*k***. A simple application of Bayes theorem gives:**

*πk =*1So, having **fk(x)** is almost equivalent to have **Pr(G=k|X=x)**. A lot of techniques are based on class densities:

- The linear and quadratic discriminant analysis uses Gaussian densities.
- Gaussian mixture models allow non-linear decision boundaries.
- Naive Bayes assumes that each of the class densities is a product of marginal densities, the inputs are conditionally independent in each class.

If we model each class density as a multivariate gaussian:

Linear discriminant analysis(LDA) assumes that the classes have a common covariance matrix ** ∑k = ∑∀k**. In comparing two classes

**and**

*k***, it is sufficient to look at the log-ratio, and:**

*l*The equal covariance matrices cause the normalization factors to cancel, as well as the quadratic part in the exponents. This linear log-odds function implies that the decision boundary between both classes is linear in** x** in

**dimensions. This remains true between all classes, so all decision boundaries are linear.**

*p*The linear discriminant functions

are equivalent to decision rules with ** G(x)= argmax_k δ_k(x)**, in practice we will need to estimate the Gaussian distribution using our training data:

In the case of 2 classes, since the derivation of LDA direction via least-squares does not use a Gaussian assumption for the features, it’s extensible beyond gaussian data, but the intercept still needs the assumption. With more than two classes, LDA is not the same as a linear regression of the class indicator matrix and avoids the masking problem.

If the ** ∑k **is not assumed to be equal, then the cancellation that we used to simplify the

**do not occur and we get the Quadratic Discriminant Analysis(QDA):**

*fk*Then, the decision boundaries between classes become quadratic *{x:δ_k(x)=δ_l(x)}.*

## Regularized discriminant analysis

This technique proposes a combination of LDA and QDA, allowing to shink the separate coefficients of QDA toward a common covariance as in LDA. The regularized covariance matrices look like:

where the estimated ** ∑ **is the pooled covariance matrix as in LDA and

**allows all models in between LDA and QDA.**

*α∈[0,1]*Another option is to allow the ** ∑ **estimated to shrunk towards the scalar covariance:

where ** γ∈[0,1]. **Replacing the estimated

**by**

*∑***leads to a more general covariances**

*∑^(γ)***indexed by pairs of covariances. This leads to a better generalization and allows for easier interpretation of coefficients.**

*∑^(α,γ)*## Reduced Rank LDA

The restriction will allow us to view an informative low dimensional projection of the data. The ** k** centroids un

**-dimensional input space lie in an affiliate subspace of dimensions ≤**

*P***, and if p is much larger than**

*K-1***, this will be a considerable drop in dimensions. Moreover, in locating the closest centroid we can ignore distances orthogonal to this subspace since they contribute equally to each class. With this, we reduce the dimension projection**

*k***into the subspace**

*X****and make the distance comparisons here.**

*H_(k-1)*- If
, we can view the data in a*k = 3*representation adding the third dimension in color or size.*2-d* - In the case of
, we have to ask for an*k>3*dimensional subspace*L<k-1**H_L*optimal for LDA. Optimal is defined as the*⊆ K_(k-1)*that spreads the centroids as much as possible.*H_L*

The dimensions are ordered, so we can compute additional dimensions in sequence. Finding the sequence of optimal subspaces for LDA involves:

- Compute the
matrix of class centroids*K x p*and the common covariance matrix*M*.*W* - Compute
using the eigendecomposition of*M* = MW^(1/2)**W.* - Compute
, the covariance matrix of*B**and its eigendecomposition*M**The columns*B* = V*D_BV*^T.*of*ve**in sequence from first to last define the coordinates of the optimal subspace.*V**

The lth discriminant variable is given by:

The between-class variance is the variance of the class means of ** Z**, and the within-class variance is the pooled variance about the means.

Although we separate the most variance variables, our data will still overlap. By taking the covariance into account, a direction with minimum overlap is found.

The between-class variance of ** Z** is

**, and the within-class variance**

*a^TBa***where**

*a^TWa***is defined earlier and**

*W***is the covariance matrix of the centroid matrix**

*B***.**

*M***, where**

*B+W=T***is the total covariance matrix of**

*T***, ignoring class information.**

*X*Then the problem amounts to maximizing the Rayleigh quotient,

This is a generalized eigenvalue problem, with a given by the largest eigenvalue of ** W^(-1)B**. Then the first direction

**is identical to**

*a1***and**

*v1***is the orthogonal direction to**

*a2***such that**

*a1***is maximized.**

*a2^(T)Ba2/a2^(T)Wa2*Instead of being created for data variable reduction, this technique works well in classification problems. It’s performed by limiting the distance to centroids. The centroids of a gaussian classification can be shown to lie in an L-dimensional subspace ** R^p. **Fitting this by maximum likelihood and then constructing the posterior probabilities using Bayes theorem rounds it.

Gaussian classification dictates the *log*** π_k **correction in distance calculation, that is because the misclassification rate is based on the area of centroids that overlap and if

**are equal, both classes will receive the same amount of errors. If not, we can end up with misclassification errors caused by getting class centroids near the cut points.**

*π_k*The objective of logistic regression is to model the posterior probabilities of K classes via linear functions in x, while at the same time ensuring that they sum to one and remain in [0,1].

Here K-1 is the number of logit transformations(log-odds). The last class is used as the denominator in the log-odds ratio. We can obtain ** Pr(G=k|X=x) **using exponents

To simplify the notation we will use Θ as the entire parameter set Θ = ** {β01,β1^T,…,β0(k-1),β(k-1)T}** and thus denote the probabilities

**Θ**

*Pr(G=k|X=x)=Pk(x:*

*).*## Fitting logistic regression models

Logistic models are usually fitted by maximum likelihood, using the conditional likelihood of G given X. Since P(G|X) specifies the conditional distribution, the multinomial distribution is appropriate. The log-likelihood for N observations is:

where ** Pk(xi:**Θ

**Θ**

*)=Pr(G=k|X=xi,*

*).*In the two-class case, the algorithm is simple, it is convenient to code the two classes ** gi = {1,2}**via a

**response**

*0,1***where:**

*yi*if*yi =1**gi =1*if*yi = 0**gi = 2*

Let ** p1(x:**Θ

**Θ**

*)=p(x:***and**

*)***Θ**

*p2(x:***Θ**

*)=1-p(x:***The log-likelihood can be written as**

*).*Here ** β = {β_01,β_1}, **and we assume that the vector of inputs

**includes the constant term**

*xi***1**to accommodate the intercept. To maximize the log-likelihood we set the derivates to

**0**.

These are p+1 equations non-linear in ** β. **To solve their equations, we can use the Newton-Raphson algorithm, which needs the second derivative or the hessian matrix.

starting with ** βold**, a single Newton update is:

where the derivatives are evaluated at ** βold. **In matrix notation, we can write it as:

Then, the Newton step looks like this:

z is known as the adjusted response. This model is an iterative reweighted least-squares(IRLS) since at each iteration it solves the weighted least squares problem:

Normally ** β=0 **is a good starting point for the iterative process.

## Quadratic approximation and inference

The maximum-likelihood parameter estimates ** β^ **satisfying a self-consistency relationship, they are coefficients of a weighted least-squares fit, where the response is:

the weights are ** wi = pi^(1-pi^)**, both depending on

**itself. Tho connection with least-squares offers a lot to us:**

*β^*- The weighted residual sum of squares is the familiar Pearson Chi-Squared statistic.
- Asymptotic likelihood theory says that if the model is correct,
is consistent, so it converges to the true*β**β*. - The Central Limit Theorem shows that the distribution of
converges to*β^**N(β,(X^TWX)^(-1)).*

As model building for logistic regression models can be costly due to the required iteration, there exist some popular shortcuts to avoid the iteration, some of them are Rao score test to Wald test, they are based on the maximum-likelihood fit of the current model.

## L1 regularized logistic regression

The L1 penalty can be used in any linear regression model. For logistic regression, we can add it as:

We can solve it with IRLS using the quadratic approximation. The score equation for non-zero coefficient variables have the form:

which generalizes the LAR. Path algorithms as LAR for lasso are more difficult because the coefficients are pice-wise smooth rather than linear. Nevertheless, progress can be estimated with quadratic approximations.

Both models seem to be the same, they have exactly the same form, but the logistic regression is more general in that It makes fewer assumptions.

The logistic regression takes the density function ** Pr(x) **and fits the parameters of

**by maximizing the conditional likelihood. Meanwhile, LDA first the parameters by maximizing the full log-likelihood, based on the prior density.**

*P(G|X)*By relying on more assumptions, logistic regression has more information about the parameters and can obtain lower variances on estimation.

Observations far from the decision boundaries are important for the covariance matrix estimation, which makes LDA not robust to gross outliers.

If the data in a plane can be separated by a line, the logistic regression will not converge to the best solution, otherwise, LDA will do it.

Logistic regression and linear discriminant analysis are both good approaches to perform a simple classification. Logistic regression is a more robust model because of the lower amount of assumptions, but in practice, both models perform similarly.