# Lasso

In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics,[1] and later by Robert Tibshirani,[2] who coined the term.

## Lasso

Lasso was originally formulated for linear regression models. This simple case reveals a substantial amount about the estimator. These include its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates do not need to be unique if covariates are collinear.

Though originally defined for linear regression, lasso regularization is easily extended to other statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators.[2][3] Lasso's ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics and convex analysis.

Prior to lasso, the most widely used method for choosing covariates was stepwise selection. That approach only improves prediction accuracy in certain cases, such as when only a few covariates have a strong relationship with the outcome. However, in other cases, it can increase prediction error.

Therefore, the lasso estimates share features of both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients, like ridge regression and set some of them to zero, as in the best subset selection case. Additionally, while ridge regression scales all of the coefficients by a constant factor, lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it.

Lasso can set coefficients to zero, while the superficially similar ridge regression cannot. This is due to the difference in the shape of their constraint boundaries. Both lasso and ridge regression can be interpreted as minimizing the same objective function

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of λ \displaystyle \lambda .[7] It is assumed that X \displaystyle X is standardized with z-scores and that y \displaystyle y is centered (zero mean). Let β 0 \displaystyle \beta _0 represent the hypothesized regression coefficients and let b O L S \displaystyle b_OLS refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values.[8] This results in

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.[2]

Group lasso allows groups of related covariates to be selected as a single unit, which can be useful in settings where it does not make sense to include some covariates without others.[10] Further extensions of group lasso perform variable selection within individual groups (sparse group lasso) and allow overlap between groups (overlap group lasso).[11][12]

Fused lasso can account for the spatial or temporal characteristics of a problem, resulting in estimates that better match system structure.[13] Lasso-regularized models can be fit using techniques including subgradient methods, least-angle regression (LARS), and proximal gradient methods. Determining the optimal value for the regularization parameter is an important part of ensuring that the model performs well; it is typically chosen using cross-validation.

In 2005, Zou and Hastie introduced the elastic net.[6] When p > n (the number of covariates is greater than the sample size) lasso can select only n covariates (even when more are associated with the outcome) and it tends to select one covariate from any set of highly correlated covariates. Additionally, even when n > p, ridge regression tends to perform better given strongly correlated covariates.

In 2006, Yuan and Lin introduced the group lasso to allow predefined groups of covariates to jointly be selected into or out of a model.[10] This is useful in many settings, perhaps most obviously when a categorical variable is coded as a collection of binary covariates. In this case, group lasso can ensure that all the variables encoding the categorical covariate are included or excluded together. Another setting in which grouping is natural is in biological studies. Since genes and proteins often lie in known pathways, which pathways are related to an outcome may be more significant than whether individual genes are. The objective function for the group lasso is a natural generalization of the standard lasso objective

In some cases, the phenomenon under study may have important spatial or temporal structure that must be considered during analysis, such as time series or image-based data. In 2005, Tibshirani and colleagues introduced the fused lasso to extend the use of lasso to this type of data.[13] The fused lasso objective function is

The first constraint is the lasso constraint, while the second directly penalizes large changes with respect to the temporal or spatial structure, which forces the coefficients to vary smoothly to reflect the system's underlying logic. Clustered lasso[14] is a generalization of fused lasso that identifies and groups relevant covariates based on their effects (coefficients). The basic idea is to penalize the differences between the coefficients so that nonzero ones cluster. This can be modeled using the following regularization:

The prior lasso was introduced for generalized linear models by Jiang et al. in 2016 to incorporate prior information, such as the importance of certain covariates.[23] In prior lasso, such information is summarized into pseudo responses (called prior responses) y ^ p \displaystyle \hat y^\mathrm p and then an additional criterion function is added to the usual objective function with a lasso penalty. Without loss of generality, in linear regression, the new objective function can be written as

Prior lasso is more efficient in parameter estimation and prediction (with a smaller estimation error and prediction error) when the prior information is of high quality, and is robust to the low quality prior information with a good choice of the balancing parameter η \displaystyle \eta .

The loss function of the lasso is not differentiable, but a wide variety of techniques from convex analysis and optimization theory have been developed to compute the solutions path of the lasso. These include coordinate descent,[24] subgradient methods, least-angle regression (LARS), and proximal gradient methods.[25] Subgradient methods are the natural generalization of traditional methods such as gradient descent and stochastic gradient descent to the case in which the objective function is not differentiable at all points. LARS is a method that is closely tied to lasso models, and in many cases allows them to be fit efficiently, though it may not perform well in all circumstances. LARS generates complete solution paths.[25] Proximal methods have become popular because of their flexibility and performance and are an area of active research. The choice of method will depend on the particular lasso variant, the data and the available resources. However, proximal methods generally perform well.

Choosing the regularization parameter ( λ \displaystyle \lambda ) is a fundamental part of lasso. A good value is essential to the performance of lasso since it controls the strength of shrinkage and variable selection, which, in moderation can improve both prediction accuracy and interpretability. However, if the regularization becomes too strong, important variables may be omitted and coefficients may be shrunk excessively, which can harm both predictive capacity and inferencing. Cross-validation is often used to find the regularization parameter.

Information criteria such as the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) might be preferable to cross-validation, because they are faster to compute and their performance is less volatile in small samples.[26] An information criterion selects the estimator's regularization parameter by maximizing a model's in-sample accuracy while penalizing its effective number of parameters/degrees of freedom. Zou et al. proposed to measure the effective degrees of freedom by counting the number of parameters that deviate from zero.[27] The degrees of freedom approach was considered flawed by Kaufman and Rosset[28] and Janson et al.,[29] because a model's degrees of freedom might increase even when it is penalized harder by the regularization parameter. As an alternative, the relative simplicity measure defined above can be used to count the effective number of parameters.[26] For the lasso, this measure is given by

A lazo or lasso (/ˈlæsoʊ/ or /læˈsuː/), also called in Mexico reata and la reata,[1][2] and in the United States riata, or lariat[3] (from Mexican Spanish, lasso for roping cattle),[4] is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Mexican and South American cowboys, then adopted, from the Mexicans, by the cowboys of the United States. The word is also a verb; to lasso is to throw the loop of rope around something. 041b061a72