-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy pathexample.txt
1 lines (1 loc) · 487 KB
/
example.txt
1
fAlgorithm for anomaly detectionIsolation Forest is an algorithm for data anomaly detection initiallydeveloped by Fei Tony Liu and Zhi-Hua Zhou in 2008.^([1]) IsolationForest detects anomalies using binary trees. The algorithm has a lineartime complexity and a low memory requirement, which works well withhigh-volume data.^([2][3])[]The normalized anomaly scores of the Isolation Forest algorithm fit onthe Old Faithful datasetIsolation Forest splits the data space using lines that are orthogonalto the origin and assigns higher anomaly scores to data points that needfewer splits to be isolated. The figure on the right shows anapplication of the Isolation Forest algorithm to the waiting timebetween eruptions and the duration of the eruption of the Old Faithfulgeyser in Yellowstone National Park. Darker shades of red indicatehigher estimated anomaly scores.History[edit]The Isolation Forest (iForest) algorithm was initially proposed by FeiTony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008.^([2]) In 2010, anextension of the algorithm - SCiforest ^([4]) was developed to addressclustered and axis-paralleled anomalies. In 2012^([3]) the same authorsdemonstrated that iForest has linear time complexity, a small memoryrequirement, and is applicable to high dimensional data.In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForestto resolve the problem of detecting anomalies in streaming data.^([5])More applications of iForest to streaming data are described in papersby Tan et al.,^([6]) Susto et al.^([7]) and Weng et al.^([8])In 2018,^([9]) an extension of iForest, aimed at improving thereliability of the anomaly score produced for a given data point wasproposed.In 2022, ^([10])the algorithm was applied to microscopy data to push thedetection limit of single un-labeled proteins.Algorithm[edit][]Fig. 2 - an example of isolating a non-anomalous point in a 2D Gaussiandistribution.The premise of the Isolation Forest algorithm is that anomalous datapoints are easier to separate from the rest of the sample. In order toisolate a data point, the algorithm recursively generates partitions onthe sample by randomly selecting an attribute and then randomlyselecting a split value between the minimum and maximum values allowedfor that attribute.[Isolating an Anomalous Point]Fig. 3 - an example of isolating an anomalous point in a 2D Gaussiandistribution.An example of random partitioning in a 2D dataset of normallydistributed points is given in Fig. 2 for a non-anomalous point and Fig.3 for a point that's more likely to be an anomaly. It is apparent fromthe pictures how anomalies require fewer random partitions to beisolated, compared to normal points.Recursive partitioning can be represented by a tree structure namedIsolation Tree, while the number of partitions required to isolate apoint can be interpreted as the length of the path, within the tree, toreach a terminating node starting from the root. For example, the pathlength of point x_(i)[x_{i}] in Fig. 2 is greater than the path lengthof x_(j)[x_{j}] in Fig. 3.Let X = {x₁, …, x_(n)}[{\displaystyle X=\{x_{1},\dots ,x_{n}\}}] be aset of d-dimensional points and X^(′) ⊂ X[{\displaystyle X'\subset X}].An Isolation Tree (iTree) is defined as a data structure with thefollowing properties:1. for each node T[T] in the Tree, T[T] is either an external-node with no child, or an internal-node with one “test” and exactly two child nodes (T_(l)[T_l] and T_(r)[T_r])2. a test at node T[T] consists of an attribute q[q] and a split value p[p] such that the test q < p[{\displaystyle q<p}] determines the traversal of a data point to either T_(l)[T_l] or T_(r)[T_r].In order to build an iTree, the algorithm recursively divides X^(′)[X']by randomly selecting an attribute q[q] and a split value p[p], untileither1. the node has only one instance, or2. all data at the node have the same values.When the iTree is fully grown, each point in X[X] is isolated at one ofthe external nodes. Intuitively, the anomalous points are those (easierto isolate, hence) with the smaller path length in the tree, where thepath length h(x_(i))[h(x_i)] of point x_(i) ∈ X[x_{i}\in X] is definedas the number of edges x_(i)[x_{i}] traverses from the root node to getto an external node.A probabilistic explanation of iTree is provided in the original iForestpaper.^([2])Properties of isolation forest[edit]- Sub-sampling: As iForest does not need to isolate all normal instances, it can frequently ignore the majority of the training sample. As a consequence, iForest works very well when the sampling size is kept small, a property that is in contrast with the great majority of existing methods, where a large sampling size is usually desirable.^([2][3])- Swamping: When normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomenon known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. One of the main reasons for swamping is the presence of too much data for the purpose of anomaly detection, which implies one possible solution to the problem is sub-sampling. Since- - - responds very well to sub-sampling in terms of performance, the reduction of the number of points in the sample is also a good way to reduce the effect of swamping.^([2])- Masking: When the number of anomalies is high it is possible that some of those aggregate in a dense and large cluster, making it more difficult to separate the single anomalies and, in turn, to detect such points as anomalous. Similarly, to swamping, this phenomenon (known as “masking”) is also more likely when the number of points in the sample is big and can be alleviated through sub-sampling.^([2])- High Dimensional Data: One of the main limitations of standard, distance-based methods is their inefficiency in dealing with high dimensional datasets.^([11]) The main reason for that is, in a high dimensional space every point is equally sparse, so using a distance-based measure of separation is pretty ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by adding a features selection test, like Kurtosis, to reduce the dimensionality of the sample space.^([2][4])- Normal Instances Only: iForest performs well even if the training set does not contain any anomalous point,^([4]) the reason being that iForest describes data distributions in such a way that high values of the path length h(x_(i))[h(x_i)] correspond to the presence of data points. As a consequence, the presence of anomalies is pretty irrelevant to iForest's detection performance.Anomaly detection with isolation forest[edit]Anomaly detection with Isolation Forest is a process composed of twomain stages:^([4])1. in the first stage, a training dataset is used to build iTrees.2. in the second stage, each instance in the test set is passed through these iTrees, and a proper “anomaly score” is assigned to the instance.Once all the instances in the test set have been assigned an anomalyscore, it is possible to mark as “anomaly” any point whose score isgreater than a predefined threshold, which depends on the domain theanalysis is being applied to.Anomaly score[edit]The algorithm for computing the anomaly score of a data point is basedon the observation that the structure of iTrees is equivalent to that ofBinary Search Trees (BST): a termination to an external node of theiTree corresponds to an unsuccessful search in the BST.^([4]) As aconsequence, the estimation of average h(x)[h(x)] for external nodeterminations is the same as that of the unsuccessful searches in BST,that is^([12])$c(m) = \left\{ \begin{matrix}{2H(m - 1) - \frac{2(m - 1)}{n}} & {\text{for~}m > 2} \\1 & {\text{for~}m = 2} \\0 & \text{otherwise} \\\end{matrix} \right.$[{\displaystyle c(m)={\begin{cases}2H(m-1)-{\frac{2(m-1)}{n}}&{\text{for }}m>2\\1&{\text{for}}m=2\\0&{\text{otherwise}}\end{cases}}}]where n[n] is the testing data size, m[m] is the size of the sample setand H[H] is the harmonic number, which can be estimated byH(i) = ln(i) + γ[{\displaystyle H(i)=ln(i)+\gamma }], whereγ = 0.5772156649[{\displaystyle \gamma =0.5772156649}] is theEuler-Mascheroni constant.The value of c(m) above represents the average of h(x)[h(x)] given m[m],so we can use it to normalize h(x)[h(x)] and get an estimation of theanomaly score for a given instance x:$s(x,m) = 2^{\frac{- E(h(x))}{c(m)}}$[{\displaystyle s(x,m)=2^{\frac{-E(h(x))}{c(m)}}}]where E(h(x))[{\displaystyle E(h(x))}] is the average value ofh(x)[h(x)] from a collection of iTrees. It is interesting to note thatfor any given instance x[x]:- if s[s] is close to 1[1] then x[x] is very likely to be an anomaly- if s[s] is smaller than 0.5[0.5] then x[x] is likely to be a normal value- if for a given sample all instances are assigned an anomaly score of around 0.5[0.5], then it is safe to assume that the sample doesn't have any anomalyOpen source implementations[edit]Original implementation:- Isolation Forest, an algorithm that detects data-anomalies using binary trees written in R. Released by the paper's first author Liu, Fei Tony in 2009.Other implementations (in alphabetical order):- Isolation Forest - A Spark/Scala implementation, created by James Verbus from the LinkedIn Anti-Abuse AI team.- Isolation Forest by H2O-3 - An implementation of Isolation Forest for Anomaly Detection by H2O-3.- Package solitude implementation in R by Srikanth Komala Sheshachala.- Python implementation with examples in scikit-learn.- Spark iForest - A distributed implementation in Scala and Python, which runs on Apache Spark. Written by Yang, Fangzhou.- PyOD IForest - Another Python implementation in the popular Python Outlier Detection (PyOD) library.Other variations of Isolation Forest algorithm implementations:- Extended Isolation Forest – An implementation of Extended Isolation Forest for Anomaly Detection by Sahand Hariri.- Extended Isolation Forest by H2O-3 - An implementation of Extended Isolation Forest for Anomaly Detection by H2O-3.- (Python, R, C/C++) Isolation Forest and variations by David Cortes - An implementation of Isolation Forest and its variations by David Cortes.See also[edit]- Anomaly detection- Random forestReferences[edit]1. ^ Tony Liu, Fei; Zhou, Zhi-Hua (2008). "Isolation Forest". IEEE Xplore: 413–422. doi:10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449.2. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) ^(g) Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation Forest". 2008 Eighth IEEE International Conference on Data Mining: 413–422. doi:10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449.3. ^ ^(a) ^(b) ^(c) Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation-Based Anomaly Detection". 2012 ACM Transactions on Knowledge Discovery from Data. 6: 3:1–3:39. doi:10.1145/2133360.2133363. S2CID 207193045.4. ^ ^(a) ^(b) ^(c) ^(d) ^(e) Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (September 2010). "On Detecting Clustered Anomalies Using SCiForest". Joint European Conference on Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2010: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. 6322: 274–290. doi:10.1007/978-3-642-15883-4_18. ISBN 978-3-642-15882-7.5. ^ Ding, Zhiguo; Fei, Minrui (September 2013). "An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data using Sliding Window". 3rd IFAC International Conference on Intelligent Control and Automation Science.6. ^ Tan, Swee Chuan; Ting, Kai Ming; Liu, Fei Tony (2011). "Fast anomaly detection for streaming data". Proceedings of the Twenty-Second international joint conference on Artificial Intelligence. Vol. 2. AAAI Press. pp. 1511–1516. doi:10.5591/978-1-57735-516-8/IJCAI11-254. ISBN 9781577355144.7. ^ Susto, Gian Antonio; Beghi, Alessandro; McLoone, Sean (2017). "Anomaly detection through on-line isolation Forest: An application to plasma etching". 2017 28th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC) (PDF). pp. 89–94. doi:10.1109/ASMC.2017.7969205. ISBN 978-1-5090-5448-0.8. ^ Weng, Yu; Liu, Lei (15 April 2019). "A Collective Anomaly Detection Approach for Multidimensional Streams in Mobile Service Security". IEEE Access. 7: 49157–49168. doi:10.1109/ACCESS.2019.2909750.9. ^ Hariri, Sahand; Carrasco Kind, Matias; Brunner, Robert J. (2019). "Extended Isolation Forest". IEEE Transactions on Knowledge and Data Engineering. 33 (4): 1479–1489. arXiv:1811.02141. doi:10.1109/TKDE.2019.2947676. S2CID 53236735.10. ^ Dahmardeh, Mahyar; Mirzaalian Dastjerdi, Houman; Mazal, Hisham; Köstler, Harald; Sandoghdar, Vahid (2023-02-27). "Self-supervised machine learning pushes the sensitivity limit in label-free detection of single proteins below 10 kDa". Nature Methods. 20 (3): 442–447. doi:10.1038/s41592-023-01778-2. ISSN 1548-7105. PMID 36849549.11. ^ Dilini Talagala, Priyanga; Hyndman, Rob J.; Smith-Miles, Kate (12 Aug 2019). "Anomaly Detection in High Dimensional Data". arXiv:1908.04000 [stat.ML].12. ^ Shaffer, Clifford A. (2011). Data structures & algorithm analysis in Java (3rd Dover ed.). Mineola, NY: Dover Publications. ISBN 9780486485812. OCLC 721884651.This article is about the machine learning technique. For other kinds ofrandom tree, see Random tree.Binary search tree based ensemble machine learning method+-----------------------------------------------------------------------+| Part of a series on |+-----------------------------------------------------------------------+| Machine learning || and data mining |+-----------------------------------------------------------------------+| [Scatterplot featuring a linear support vector machine's decision || boundary (dashed line)] |+-----------------------------------------------------------------------+| Paradigms || || - Supervised learning || - Unsupervised learning || - Online learning || - Batch learning || - Meta-learning || - Semi-supervised learning || - Self-supervised learning || - Reinforcement learning || - Rule-based learning || - Quantum machine learning |+-----------------------------------------------------------------------+| Problems || || - Classification || - Regression || - Clustering || - dimension reduction || - density estimation || - Anomaly detection || - Data Cleaning || - AutoML || - Association rules || - Semantic analysis || - Structured prediction || - Feature engineering || - Feature learning || - Learning to rank || - Grammar induction || - Ontology learning |+-----------------------------------------------------------------------+| Supervised learning || (classification • regression) || || - Decision trees || - Ensembles || - Bagging || - Boosting || - Random forest || - k-NN || - Linear regression || - Naive Bayes || - Artificial neural networks || - Logistic regression || - Perceptron || - Relevance vector machine (RVM) || - Support vector machine (SVM) |+-----------------------------------------------------------------------+| Clustering || || - BIRCH || - CURE || - Hierarchical || - k-means || - Fuzzy || - Expectation–maximization (EM) || - || DBSCAN || - OPTICS || - Mean shift |+-----------------------------------------------------------------------+| Dimensionality reduction || || - Factor analysis || - CCA || - ICA || - LDA || - NMF || - PCA || - PGD || - t-SNE || - SDL |+-----------------------------------------------------------------------+| Structured prediction || || - Graphical models || - Bayes net || - Conditional random field || - Hidden Markov |+-----------------------------------------------------------------------+| Anomaly detection || || - RANSAC || - k-NN || - Local outlier factor || - Isolation forest |+-----------------------------------------------------------------------+| Artificial neural network || || - Autoencoder || - Cognitive computing || - Deep learning || - DeepDream || - Multilayer perceptron || - RNN || - LSTM || - GRU || - ESN || - reservoir computing || - Restricted Boltzmann machine || - GAN || - SOM || - Convolutional neural network || - U-Net || - Transformer || - Vision || - Spiking neural network || - Memtransistor || - Electrochemical RAM (ECRAM) |+-----------------------------------------------------------------------+| Reinforcement learning || || - Q-learning || - SARSA || - Temporal difference (TD) || - Multi-agent || - Self-play |+-----------------------------------------------------------------------+| Learning with humans || || - Active learning || - Crowdsourcing || - Human-in-the-loop |+-----------------------------------------------------------------------+| Model diagnostics || || - Learning curve |+-----------------------------------------------------------------------+| Theory || || - Kernel machines || - Bias–variance tradeoff || - Computational learning theory || - Empirical risk minimization || - Occam learning || - PAC learning || - Statistical learning || - VC theory |+-----------------------------------------------------------------------+| Machine-learning venues || || - NeurIPS || - ICML || - ICLR || - ML || - JMLR |+-----------------------------------------------------------------------+| Related articles || || - Glossary of artificial intelligence || - List of datasets for machine-learning research || - Outline of machine learning |+-----------------------------------------------------------------------+| - v || - t || - e |+-----------------------------------------------------------------------+[]Diagram of a random decision forestRandom forests or random decision forests is an ensemble learning methodfor classification, regression and other tasks that operates byconstructing a multitude of decision trees at training time. Forclassification tasks, the output of the random forest is the classselected by most trees. For regression tasks, the mean or averageprediction of the individual trees is returned.^([1][2]) Random decisionforests correct for decision trees' habit of overfitting to theirtraining set.^([3]: 587–588 ) Random forests generally outperformdecision trees, but their accuracy is lower than gradient boostedtrees.^([citation needed]) However, data characteristics can affecttheir performance.^([4][5])The first algorithm for random decision forests was created in 1995 byTin Kam Ho^([1]) using the random subspace method,^([2]) which, in Ho'sformulation, is a way to implement the "stochastic discrimination"approach to classification proposed by Eugene Kleinberg.^([6][7][8])An extension of the algorithm was developed by Leo Breiman^([9]) andAdele Cutler,^([10]) who registered^([11]) "Random Forests" as atrademark in 2006 (as of 2019^([update]), owned by Minitab,Inc.).^([12]) The extension combines Breiman's "bagging" idea and randomselection of features, introduced first by Ho^([1]) and laterindependently by Amit and Geman^([13]) in order to construct acollection of decision trees with controlled variance.Random forests are frequently used as black box models in businesses, asthey generate reasonable predictions across a wide range of data whilerequiring little configuration.^([citation needed])History[edit]The general method of random decision forests was first proposed by Hoin 1995.^([1]) Ho established that forests of trees splitting withoblique hyperplanes can gain accuracy as they grow without sufferingfrom overtraining, as long as the forests are randomly restricted to besensitive to only selected feature dimensions. A subsequent work alongthe same lines^([2]) concluded that other splitting methods behavesimilarly, as long as they are randomly forced to be insensitive to somefeature dimensions. Note that this observation of a more complexclassifier (a larger forest) getting more accurate nearly monotonicallyis in sharp contrast to the common belief that the complexity of aclassifier can only grow to a certain level of accuracy before beinghurt by overfitting. The explanation of the forest method's resistanceto overtraining can be found in Kleinberg's theory of stochasticdiscrimination.^([6][7][8])The early development of Breiman's notion of random forests wasinfluenced by the work of Amit andGeman^([13]) who introduced the ideaof searching over a random subset of theavailable decisions whensplitting a node, in the context of growing a singletree. The idea ofrandom subspace selection from Ho^([2]) was also influential in thedesign of random forests. In this method a forest of trees is grown,andvariation among the trees is introduced by projecting the trainingdatainto a randomly chosen subspace before fitting each tree or eachnode. Finally, the idea ofrandomized node optimization, where thedecision at each node is selected by arandomized procedure, rather thana deterministic optimization was firstintroduced by Thomas G.Dietterich.^([14])The proper introduction of random forests was made in a paperby LeoBreiman.^([9]) This paper describes a method of building a forestofuncorrelated trees using a CART like procedure, combined withrandomized nodeoptimization and bagging. In addition, this papercombines severalingredients, some previously known and some novel, whichform the basis of themodern practice of random forests, in particular:1. Using out-of-bag error as an estimate of the generalization error.2. Measuring variable importance through permutation.The report also offers the first theoretical result for random forestsin theform of a bound on the generalization error which depends on thestrength of thetrees in the forest and their correlation.Algorithm[edit]Preliminaries: decision tree learning[edit]Main article: Decision tree learningDecision trees are a popular method for various machine learning tasks.Tree learning "come[s] closest to meeting the requirements for servingas an off-the-shelf procedure for data mining", say Hastie et al.,"because it is invariant under scaling and various other transformationsof feature values, is robust to inclusion of irrelevant features, andproduces inspectable models. However, they are seldomaccurate".^([3]: 352 )In particular, trees that are grown very deep tend to learn highlyirregular patterns: they overfit their training sets, i.e. have lowbias, but very high variance. Random forests are a way of averagingmultiple deep decision trees, trained on different parts of the sametraining set, with the goal of reducing the variance.^([3]: 587–588 )This comes at the expense of a small increase in the bias and some lossof interpretability, but generally greatly boosts the performance in thefinal model.Forests are like the pulling together of decision tree algorithmefforts. Taking the teamwork of many trees thus improving theperformance of a single random tree. Though not quite similar, forestsgive the effects of a k-fold cross validation.Bagging[edit]Main article: Bootstrap aggregatingThe training algorithm for random forests applies the general techniqueof bootstrap aggregating, or bagging, to tree learners. Given a trainingset X = x₁, ..., x_(n) with responses Y = y₁, ..., y_(n), baggingrepeatedly (B times) selects a random sample with replacement of thetraining set and fits trees to these samples:For b = 1, ..., B:1. Sample, with replacement, n training examples from X, Y; call these X_(b), Y_(b).2. Train a classification or regression tree f_(b) on X_(b), Y_(b).After training, predictions for unseen samples x' can be made byaveraging the predictions from all the individual regression trees onx':$\hat{f} = \frac{1}{B}\sum\limits_{b = 1}^{B}f_{b}(x^{\prime})$[{\displaystyle{\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')}]or by taking the majority vote^([clarify]) in the case of classificationtrees.This bootstrapping procedure leads to better model performance becauseit decreases the variance of the model, without increasing the bias.This means that while the predictions of a single tree are highlysensitive to noise in its training set, the average of many trees isnot, as long as the trees are not correlated. Simply training many treeson a single training set would give strongly correlated trees (or eventhe same tree many times, if the training algorithm is deterministic);bootstrap sampling is a way of de-correlating the trees by showing themdifferent training sets.Additionally, an estimate of the uncertainty of the prediction can bemade as the standard deviation of the predictions from all theindividual regression trees on x':$\sigma = \sqrt{\frac{\sum\limits_{b = 1}^{B}(f_{b}(x^{\prime}) - \hat{f})^{2}}{B - 1}}.$[{\displaystyle\sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat{f}})^{2}}{B-1}}}.}]The number of samples/trees, B, is a free parameter. Typically, a fewhundred to several thousand trees are used, depending on the size andnature of the training set. An optimal number of trees B can be foundusing cross-validation, or by observing the out-of-bag error: the meanprediction error on each training sample x_(i), using only the treesthat did not have x_(i) in their bootstrap sample.^([15])The trainingand test error tend to level off after some number of trees have beenfit.From bagging to random forests[edit]Main article: Random subspace methodThe above procedure describes the original bagging algorithm for trees.Random forests also include another type of bagging scheme: they use amodified tree learning algorithm that selects, at each candidate splitin the learning process, a random subset of the features. This processis sometimes called "feature bagging". The reason for doing this is thecorrelation of the trees in an ordinary bootstrap sample: if one or afew features are very strong predictors for the response variable(target output), these features will be selected in many of the B trees,causing them to become correlated. An analysis of how bagging and randomsubspace projection contribute to accuracy gains under differentconditions is given by Ho.^([16])Typically, for a classification problem with p features, √p (roundeddown) features are used in each split.^([3]: 592 ) For regressionproblems the inventors recommend p/3 (rounded down) with a minimum nodesize of 5 as the default.^([3]: 592 ) In practice, the best values forthese parameters should be tuned on a case-to-case basis for everyproblem.^([3]: 592 )ExtraTrees[edit]Adding one further step of randomization yields extremely randomizedtrees, or ExtraTrees. While similar to ordinary random forests in thatthey are an ensemble of individual trees, there are two maindifferences: first, each tree is trained using the whole learning sample(rather than a bootstrap sample), and second, the top-down splitting inthe tree learner is randomized. Instead of computing the locally optimalcut-point for each feature under consideration (based on, e.g.,information gain or the Gini impurity), a random cut-point is selected.This value is selected from a uniform distribution within the feature'sempirical range (in the tree's training set). Then, of all the randomlygenerated splits, the split that yields the highest score is chosen tosplit the node. Similar to ordinary random forests, the number ofrandomly selected features to be considered at each node can bespecified. Default values for this parameter are $\sqrt{p}$[{\sqrt {p}}]for classification and p[p] for regression, where p[p] is the number offeatures in the model.^([17])Properties[edit]Variable importance[edit]Random forests can be used to rank the importance of variables in aregression or classification problem in a natural way. The followingtechnique was described in Breiman's original paper^([9]) and isimplemented in the R package randomForest.^([10])The first step in measuring the variable importance in a data set𝒟_(n) = {(X_(i),Y_(i))}_(i = 1)^(n)[{\mathcal{D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}] is to fit a random forest to thedata. During the fitting process the out-of-bag error for each datapoint is recorded and averaged over the forest (errors on an independenttest set can be substituted if bagging is not used during training).To measure the importance of the j[j]-th feature after training, thevalues of the j[j]-th feature are permuted among the training data andthe out-of-bag error is again computed on this perturbed data set. Theimportance score for the j[j]-th feature is computed by averaging thedifference in out-of-bag error before and after the permutation over alltrees. The score is normalized by the standard deviation of thesedifferences.Features which produce large values for this score are ranked as moreimportant than features which produce small values. The statisticaldefinition of the variable importance measure was given and analyzed byZhu et al.^([18])This method of determining variable importance has some drawbacks. Fordata including categorical variables with different number of levels,random forests are biased in favor of those attributes with more levels.Methods such as partial permutations^([19][20][4])and growing unbiasedtrees^([21][22]) can be used to solve the problem. If the data containgroups of correlated features of similar relevance for the output, thensmaller groups are favored over larger groups.^([23])Relationship to nearest neighbors[edit]A relationship between random forests and the k-nearest neighboralgorithm (k-NN) was pointed out by Lin and Jeon in 2002.^([24]) Itturns out that both can be viewed as so-called weighted neighborhoodsschemes. These are models built from a training set{(x_(i),y_(i))}_(i = 1)^(n)[\{(x_{i},y_{i})\}_{i=1}^{n}] that makepredictions ŷ[{\hat {y}}] for new points x' by looking at the"neighborhood" of the point, formalized by a weight function W:$\hat{y} = \sum\limits_{i = 1}^{n}W(x_{i},x^{\prime})\, y_{i}.$[{\hat{y}}=\sum _{i=1}^{n}W(x_{i},x')\,y_{i}.]Here, W(x_(i),x^(′))[W(x_{i},x')] is the non-negative weight of the i'thtraining point relative to the new point x' in the same tree. For anyparticular x', the weights for points x_(i)[x_{i}] must sum to one.Weight functions are given as follows:- In k-NN, the weights are $W(x_{i},x^{\prime}) = \frac{1}{k}$[W(x_{i},x')={\frac {1}{k}}] if x_(i) is one of the k points closest to x', and zero otherwise.- In a tree, $W(x_{i},x^{\prime}) = \frac{1}{k^{\prime}}$[W(x_{i},x')={\frac {1}{k'}}] if x_(i) is one of the k' points in the same leaf as x', and zero otherwise.Since a forest averages the predictions of a set of m trees withindividual weight functions W_(j)[W_{j}], its predictions are$\hat{y} = \frac{1}{m}\sum\limits_{j = 1}^{m}\sum\limits_{i = 1}^{n}W_{j}(x_{i},x^{\prime})\, y_{i} = \sum\limits_{i = 1}^{n}\left( {\frac{1}{m}\sum\limits_{j = 1}^{m}W_{j}(x_{i},x^{\prime})} \right)\, y_{i}.$[{\hat{y}}={\frac {1}{m}}\sum _{j=1}^{m}\sum_{i=1}^{n}W_{j}(x_{i},x')\,y_{i}=\sum _{i=1}^{n}\left({\frac {1}{m}}\sum_{j=1}^{m}W_{j}(x_{i},x')\right)\,y_{i}.]This shows that the whole forest is again a weighted neighborhoodscheme, with weights that average those of the individual trees. Theneighbors of x' in this interpretation are the points x_(i)[x_{i}]sharing the same leaf in any tree j[j]. In this way, the neighborhood ofx' depends in a complex way on the structure of the trees, and thus onthe structure of the training set. Lin and Jeon show that the shape ofthe neighborhood used by a random forest adapts to the local importanceof each feature.^([24])Unsupervised learning with random forests[edit]As part of their construction, random forest predictors naturally leadto a dissimilarity measure among the observations. One can also define arandom forest dissimilarity measure between unlabeled data: the idea isto construct a random forest predictor that distinguishes the "observed"data from suitably generated synthetic data.^([9][25])The observed dataare the original unlabeled data and the synthetic data are drawn from areference distribution. A random forest dissimilarity can be attractivebecause it handles mixed variable types very well, is invariant tomonotonic transformations of the input variables, and is robust tooutlying observations. The random forest dissimilarity easily deals witha large number of semi-continuous variables due to its intrinsicvariable selection; for example, the "Addcl 1" random forestdissimilarity weighs the contribution of each variable according to howdependent it is on other variables. The random forest dissimilarity hasbeen used in a variety of applications, e.g. to find clusters ofpatients based on tissue marker data^([26]), and label-free detection ofproteins^([27]).Variants[edit]Instead of decision trees, linear models have been proposed andevaluated as base estimators in random forests, in particularmultinomial logistic regression and naive Bayesclassifiers.^([5][28][29]) In cases that the relationship between thepredictors and the target variable is linear, the base learners may havean equally high accuracy as the ensemble learner.^([30][5])Kernel random forest[edit]In machine learning, kernel random forests (KeRF) establish theconnection between random forests and kernel methods. By slightlymodifying their definition, random forests can be rewritten as kernelmethods, which are more interpretable and easier to analyze.^([31])History[edit]Leo Breiman^([32]) was the first person to notice the link betweenrandom forest and kernel methods. He pointed out that random forestswhich are grown using i.i.d. random vectors in the tree construction areequivalent to a kernel acting on the true margin. Lin and Jeon^([33])established the connection between random forests and adaptive nearestneighbor, implying that random forests can be seen as adaptive kernelestimates. Davies and Ghahramani^([34]) proposed Random Forest Kerneland show that it can empirically outperform state-of-art kernel methods.Scornet^([31]) first defined KeRF estimates and gave the explicit linkbetween KeRF estimates and random forest. He also gave explicitexpressions for kernels based on centered random forest^([35]) anduniform random forest,^([36]) two simplified models of random forest. Henamed these two KeRFs Centered KeRF and Uniform KeRF, and proved upperbounds on their rates of consistency.Notations and definitions[edit]Preliminaries: Centered forests[edit]Centered forest^([35]) is a simplified model for Breiman's originalrandom forest, which uniformly selects an attribute among all attributesand performs splits at the center of the cell along the pre-chosenattribute. The algorithm stops when a fully binary tree of level k[k] isbuilt, where k ∈ ℕ[k\in {\mathbb {N}}] is a parameter of the algorithm.Uniform forest[edit]Uniform forest^([36]) is another simplified model for Breiman's originalrandom forest, which uniformly selects a feature among all features andperforms splits at a point uniformly drawn on the side of the cell,along the preselected feature.From random forest to KeRF[edit]Given a training sample 𝒟_(n) = {(X_(i),Y_(i))}_(i = 1)^(n)[{\mathcal{D}}_{n}=\{({\mathbf {X}}_{i},Y_{i})\}_{{i=1}}^{n}] of[0, 1]^(p) × ℝ[[0,1]^{p}\times {\mathbb {R}}]-valued independent randomvariables distributed as the independent prototype pair (X,Y)[({\mathbf{X}},Y)], where E[Y²] < ∞[{\displaystyle \operatorname {E}[Y^{2}]<\infty }]. We aim at predicting the response Y[Y], associatedwith the random variable X[\mathbf {X} ], by estimating the regressionfunction m(x) = E[Y ∣ X = x][{\displaystyle m(\mathbf {x})=\operatorname {E} [Y\mid \mathbf {X} =\mathbf {x} ]}]. A randomregression forest is an ensemble of M[M] randomized regression trees.Denote m_(n)(x,Θ_(j))[m_{n}({\mathbf {x}},{\mathbf {\Theta }}_{j})] thepredicted value at point x[\mathbf {x} ] by the j[j]-th tree, whereΘ₁, …, Θ_(M)[{\displaystyle \mathbf {\Theta } _{1},\ldots ,\mathbf{\Theta } _{M}}] are independent random variables, distributed as ageneric random variable Θ[{\mathbf {\Theta }}], independent of thesample 𝒟_(n)[\mathcal{D}_n]. This random variable can be used todescribe the randomness induced by node splitting and the samplingprocedure for tree construction. The trees are combined to form thefinite forest estimate$m_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M}) = \frac{1}{M}\sum\limits_{j = 1}^{M}m_{n}(\mathbf{x},\Theta_{j})$[{\displaystylem_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum_{j=1}^{M}m_{n}(\mathbf {x} ,\Theta _{j})}].For regression trees, wehave$m_{n} = \sum\limits_{i = 1}^{n}\frac{Y_{i}1_{\mathbf{X}_{i} \in A_{n}(\mathbf{x},\Theta_{j})}}{N_{n}(\mathbf{x},\Theta_{j})}$[m_{n}=\sum_{{i=1}}^{n}{\frac {Y_{i}{\mathbf {1}}_{{{\mathbf {X}}_{i}\inA_{n}({\mathbf {x}},\Theta _{j})}}}{N_{n}({\mathbf {x}},\Theta _{j})}}],where A_(n)(x,Θ_(j))[A_{n}({\mathbf {x}},\Theta _{j})] is the cellcontaining x[\mathbf {x} ], designed with randomness Θ_(j)[\Theta _{j}]and dataset 𝒟_(n)[\mathcal{D}_n], and$N_{n}(\mathbf{x},\Theta_{j}) = \sum\limits_{i = 1}^{n}1_{\mathbf{X}_{i} \in A_{n}(\mathbf{x},\Theta_{j})}$[N_{n}({\mathbf{x}},\Theta _{j})=\sum _{{i=1}}^{n}{\mathbf {1}}_{{{\mathbf {X}}_{i}\inA_{n}({\mathbf {x}},\Theta _{j})}}].Thus random forest estimates satisfy, for all x ∈ [0, 1]^(d)[{\mathbf{x}}\in [0,1]^{d}],$m_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M}) = \frac{1}{M}\sum\limits_{j = 1}^{M}\left( {\sum\limits_{i = 1}^{n}\frac{Y_{i}1_{\mathbf{X}_{i} \in A_{n}(\mathbf{x},\Theta_{j})}}{N_{n}(\mathbf{x},\Theta_{j})}} \right)$[m_{{M,n}}({\mathbf{x}},\Theta _{1},\ldots ,\Theta _{M})={\frac {1}{M}}\sum_{{j=1}}^{M}\left(\sum _{{i=1}}^{n}{\frac {Y_{i}{\mathbf {1}}_{{{\mathbf{X}}_{i}\in A_{n}({\mathbf {x}},\Theta _{j})}}}{N_{n}({\mathbf{x}},\Theta _{j})}}\right)]. Random regression forest has two levels ofaveraging, first over the samples in the target cell of a tree, thenover all trees. Thus the contributions of observations that are in cellswith a high density of data points are smaller than that of observationswhich belong to less populated cells. In order to improve the randomforest methods and compensate the misestimation, Scornet^([31]) definedKeRF by${\overset{\sim}{m}}_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M}) = \frac{1}{\sum\limits_{j = 1}^{M}N_{n}(\mathbf{x},\Theta_{j})}\sum\limits_{j = 1}^{M}\sum\limits_{i = 1}^{n}Y_{i}1_{\mathbf{X}_{i} \in A_{n}(\mathbf{x},\Theta_{j})},$[{\displaystyle{\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac{1}{\sum _{j=1}^{M}N_{n}(\mathbf {x} ,\Theta _{j})}}\sum _{j=1}^{M}\sum_{i=1}^{n}Y_{i}\mathbf {1} _{\mathbf {X} _{i}\in A_{n}(\mathbf {x},\Theta _{j})},}]which is equal to the mean of the Y_(i)[Y_{i}]'s falling in the cellscontaining x[\mathbf {x} ] in the forest. If we define the connectionfunction of the M[M] finite forest as$K_{M,n}(\mathbf{x},\mathbf{z}) = \frac{1}{M}\sum\limits_{j = 1}^{M}1_{\mathbf{z} \in A_{n}(\mathbf{x},\Theta_{j})}$[{\displaystyleK_{M,n}(\mathbf {x} ,\mathbf {z} )={\frac {1}{M}}\sum _{j=1}^{M}\mathbf{1} _{\mathbf {z} \in A_{n}(\mathbf {x} ,\Theta _{j})}}], i.e. theproportion of cells shared between x[\mathbf {x} ] and z[\mathbf {z} ],then almost surely we have${\overset{\sim}{m}}_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M}) = \frac{\sum\limits_{i = 1}^{n}Y_{i}K_{M,n}(\mathbf{x},\mathbf{x}_{i})}{\sum\limits_{\ell = 1}^{n}K_{M,n}(\mathbf{x},\mathbf{x}_{\ell})}$[{\displaystyle{\tilde {m}}_{M,n}(\mathbf {x} ,\Theta _{1},\ldots ,\Theta _{M})={\frac{\sum _{i=1}^{n}Y_{i}K_{M,n}(\mathbf {x} ,\mathbf {x} _{i})}{\sum _{\ell=1}^{n}K_{M,n}(\mathbf {x} ,\mathbf {x} _{\ell })}}}], which defines theKeRF.Centered KeRF[edit]The construction of Centered KeRF of level k[k] is the same as forcentered forest, except that predictions are made by${\overset{\sim}{m}}_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M})$[{\tilde{m}}_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})], thecorresponding kernel function, or connection function is$\begin{matrix}{K_{k}^{cc}(\mathbf{x},\mathbf{z}) = \sum\limits_{k_{1},\ldots,k_{d},\sum\limits_{j = 1}^{d}k_{j} = k}} & {\frac{k!}{k_{1}!\cdots k_{d}!}\left( \frac{1}{d} \right)^{k}\prod\limits_{j = 1}^{d}1_{\lceil 2^{k_{j}}x_{j}\rceil = \lceil 2^{k_{j}}z_{j}\rceil},} \\ & {\text{~for\ all~}\mathbf{x},\mathbf{z} \in \lbrack 0,1\rbrack^{d}.} \\\end{matrix}$[{\displaystyle {\begin{aligned}K_{k}^{cc}(\mathbf {x},\mathbf {z} )=\sum _{k_{1},\ldots ,k_{d},\sum _{j=1}^{d}k_{j}=k}&{\frac{k!}{k_{1}!\cdots k_{d}!}}\left({\frac {1}{d}}\right)^{k}\prod_{j=1}^{d}\mathbf {1} _{\lceil 2^{k_{j}}x_{j}\rceil =\lceil2^{k_{j}}z_{j}\rceil },\\&{\text{ for all }}\mathbf {x} ,\mathbf {z} \in[0,1]^{d}.\end{aligned}}}]Uniform KeRF[edit]Uniform KeRF is built in the same way as uniform forest, except thatpredictions are made by${\overset{\sim}{m}}_{M,n}(\mathbf{x},\Theta_{1},\ldots,\Theta_{M})$[{\tilde{m}}_{{M,n}}({\mathbf {x}},\Theta _{1},\ldots ,\Theta _{M})], thecorresponding kernel function, or connection function is$K_{k}^{uf}(0,\mathbf{x}) = \sum\limits_{k_{1},\ldots,k_{d},\sum\limits_{j = 1}^{d}k_{j} = k}\frac{k!}{k_{1}!\ldots k_{d}!}\left( \frac{1}{d} \right)^{k}\prod\limits_{m = 1}^{d}\left( {1 - |x_{m}|\sum\limits_{j = 0}^{k_{m} - 1}\frac{( - \ln|x_{m}|)^{j}}{j!}} \right)\text{~for\ all~}\mathbf{x} \in \lbrack 0,1\rbrack^{d}.$[{\displaystyleK_{k}^{uf}(\mathbf {0} ,\mathbf {x} )=\sum _{k_{1},\ldots ,k_{d},\sum_{j=1}^{d}k_{j}=k}{\frac {k!}{k_{1}!\ldots k_{d}!}}\left({\frac{1}{d}}\right)^{k}\prod _{m=1}^{d}\left(1-|x_{m}|\sum_{j=0}^{k_{m}-1}{\frac {(-\ln |x_{m}|)^{j}}{j!}}\right){\text{ for all}}\mathbf {x} \in [0,1]^{d}.}]Properties[edit]Relation between KeRF and random forest[edit]Predictions given by KeRF and random forests are close if the number ofpoints in each cell is controlled: Assume that there exist sequences (a_(n)), (b_(n))[{\displaystyle (a_{n}),(b_{n})}] such that, almost surely, $a_{n} \leq N_{n}(\mathbf{x},\Theta) \leq b_{n}\text{~and~}a_{n} \leq \frac{1}{M}\sum\limits_{m = 1}^{M}N_{n}{\mathbf{x},\Theta_{m}} \leq b_{n}.$[{\displaystyle a_{n}\leq N_{n}(\mathbf {x} ,\Theta )\leq b_{n}{\text{ and }}a_{n}\leq {\frac {1}{M}}\sum _{m=1}^{M}N_{n}{\mathbf {x} ,\Theta _{m}}\leq b_{n}.}] Then almost surely, $|m_{M,n}(\mathbf{x}) - {\overset{\sim}{m}}_{M,n}(\mathbf{x})| \leq \frac{b_{n} - a_{n}}{a_{n}}{\overset{\sim}{m}}_{M,n}(\mathbf{x}).$[{\displaystyle |m_{M,n}(\mathbf {x} )-{\tilde {m}}_{M,n}(\mathbf {x} )|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{M,n}(\mathbf {x} ).}]Relation between infinite KeRF and infinite random forest[edit]When the number of trees M[M] goes to infinity, then we have infiniterandom forest and infinite KeRF. Their estimates are close if the numberof observations in each cell is bounded: Assume that there exist sequences (ε_(n)), (a_(n)), (b_(n))[{\displaystyle (\varepsilon _{n}),(a_{n}),(b_{n})}] such that, almost surely - E[N_(n)(x,Θ)] ≥ 1,[{\displaystyle \operatorname {E} [N_{n}(\mathbf {x} ,\Theta )]\geq 1,}] - P[a_(n) ≤ N_(n)(x,Θ) ≤ b_(n) ∣ 𝒟_(n)] ≥ 1 − ε_(n)/2,[{\displaystyle \operatorname {P} [a_{n}\leq N_{n}(\mathbf {x} ,\Theta )\leq b_{n}\mid {\mathcal {D}}_{n}]\geq 1-\varepsilon _{n}/2,}] - P[a_(n) ≤ E_(Θ)[N_(n)(x,Θ)] ≤ b_(n) ∣ 𝒟_(n)] ≥ 1 − ε_(n)/2,[{\displaystyle \operatorname {P} [a_{n}\leq \operatorname {E} _{\Theta }[N_{n}(\mathbf {x} ,\Theta )]\leq b_{n}\mid {\mathcal {D}}_{n}]\geq 1-\varepsilon _{n}/2,}] Then almost surely, $|m_{\infty,n}(\mathbf{x}) - {\overset{\sim}{m}}_{\infty,n}(\mathbf{x})| \leq \frac{b_{n} - a_{n}}{a_{n}}{\overset{\sim}{m}}_{\infty,n}(\mathbf{x}) + n\varepsilon_{n}\left( {\max\limits_{1 \leq i \leq n}Y_{i}} \right).$[{\displaystyle |m_{\infty ,n}(\mathbf {x} )-{\tilde {m}}_{\infty ,n}(\mathbf {x} )|\leq {\frac {b_{n}-a_{n}}{a_{n}}}{\tilde {m}}_{\infty ,n}(\mathbf {x} )+n\varepsilon _{n}\left(\max _{1\leq i\leq n}Y_{i}\right).}]Consistency results[edit]Assume that Y = m(X) + ε[Y=m({\mathbf {X}})+\varepsilon ], whereε[\varepsilon ] is a centered Gaussian noise, independent of X[\mathbf{X} ], with finite variance σ² < ∞[\sigma ^{2}<\infty ]. Moreover,X[\mathbf {X} ] is uniformly distributed on [0, 1]^(d)[[0,1]^{d}] andm[m] is Lipschitz. Scornet^([31]) proved upper bounds on the rates ofconsistency for centered KeRF and uniform KeRF.Consistency of centered KeRF[edit]Providing k → ∞[k\rightarrow\infty] and n/2^(k) → ∞[n/2^{k}\rightarrow\infty ], there exists a constant C₁ > 0[C_{1}>0] such that, for alln[n],$\mathbb{E}\lbrack{\overset{\sim}{m}}_{n}^{cc}(\mathbf{X}) - m(\mathbf{X})\rbrack^{2} \leq C_{1}n^{- 1/(3 + d\log 2)}(\log n)^{2}$[{\mathbb{E}}[{\tilde {m}}_{n}^{{cc}}({\mathbf {X}})-m({\mathbf {X}})]^{2}\leqC_{1}n^{{-1/(3+d\log 2)}}(\log n)^{2}].Consistency of uniform KeRF[edit]Providing k → ∞[k\rightarrow\infty] and n/2^(k) → ∞[n/2^{k}\rightarrow\infty ], there exists a constant C > 0[C>0] suchthat,$\mathbb{E}\lbrack{\overset{\sim}{m}}_{n}^{uf}(\mathbf{X}) - m(\mathbf{X})\rbrack^{2} \leq Cn^{- 2/(6 + 3d\log 2)}(\log n)^{2}$[{\mathbb{E}}[{\tilde {m}}_{n}^{{uf}}({\mathbf {X}})-m({\mathbf {X}})]^{2}\leqCn^{{-2/(6+3d\log 2)}}(\log n)^{2}].Disadvantages[edit]While random forests often achieve higher accuracy than a singledecision tree, they sacrifice the intrinsic interpretability present indecision trees. Decision trees are among a fairly small family ofmachine learning models that are easily interpretable along with linearmodels, rule-based models, and attention-based models. Thisinterpretability is one of the most desirable qualities of decisiontrees. It allows developers to confirm that the model has learnedrealistic information from the data and allows end-users to have trustand confidence in the decisions made by the model.^([5][3]) For example,following the path that a decision tree takes to make its decision isquite trivial, but following the paths of tens or hundreds of trees ismuch harder. To achieve both performance and interpretability, somemodel compression techniques allow transforming a random forest into aminimal "born-again" decision tree that faithfully reproduces the samedecision function.^([5][37][38]) If it is established that thepredictive attributes are linearly correlated with the target variable,using random forest may not enhance the accuracy of the baselearner.^([5][30]) Furthermore, in problems with multiple categoricalvariables, random forest may not be able to increase the accuracy of thebase learner.^([39])See also[edit]- Boosting – Method in machine learning- Decision tree learning – Machine learning algorithm- Ensemble learning – Statistics and machine learning technique- Gradient boosting – Machine learning technique- Non-parametric statistics – Branch of statistics that is not based solely on parametrized families of probability distributionsPages displaying short descriptions of redirect targets- Randomized algorithm – Algorithm that employs a degree of randomness as part of its logic or procedureReferences[edit]1. ^ ^(a) ^(b) ^(c) ^(d) Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.2. ^ ^(a) ^(b) ^(c) ^(d) Ho TK (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601.3. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) ^(g) Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5.4. ^ ^(a) ^(b) Piryonesi S. Madeh; El-Diraby Tamer E. (2020-06-01). "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems". Journal of Transportation Engineering, Part B: Pavements. 146 (2): 04020022. doi:10.1061/JPEODX.0000175. S2CID 216485629.5. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) Piryonesi, S. Madeh; El-Diraby, Tamer E. (2021-02-01). "Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling". Journal of Infrastructure Systems. 27 (2): 04021005. doi:10.1061/(ASCE)IS.1943-555X.0000602. ISSN 1076-0342. S2CID 233550030.6. ^ ^(a) ^(b) Kleinberg E (1990). "Stochastic Discrimination" (PDF). Annals of Mathematics and Artificial Intelligence. 1 (1–4): 207–239. CiteSeerX 10.1.1.25.6750. doi:10.1007/BF01531079. S2CID 206795835. Archived from the original (PDF) on 2018-01-18.7. ^ ^(a) ^(b) Kleinberg E (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition". Annals of Statistics. 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR 1425956.8. ^ ^(a) ^(b) Kleinberg E (2000). "On the Algorithmic Implementation of Stochastic Discrimination" (PDF). IEEE Transactions on PAMI. 22 (5): 473–490. CiteSeerX 10.1.1.33.4131. doi:10.1109/34.857004. S2CID 3563126. Archived from the original (PDF) on 2018-01-18.9. ^ ^(a) ^(b) ^(c) ^(d) Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5–32. Bibcode:2001MachL..45....5B. doi:10.1023/A:1010933404324.10. ^ ^(a) ^(b) Liaw A (16 October 2012). "Documentation for R package randomForest" (PDF). Retrieved 15 March 2013.11. ^ U.S. trademark registration number 3185828, registered 2006/12/19.12. ^ "RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks".13. ^ ^(a) ^(b) Amit Y, Geman D (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation. 9 (7): 1545–1588. CiteSeerX 10.1.1.57.6069. doi:10.1162/neco.1997.9.7.1545. S2CID 12470146.14. ^ Dietterich, Thomas (2000). "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization". Machine Learning. 40 (2): 139–157. doi:10.1023/A:1007607513941.15. ^ Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. pp. 316–321.16. ^ Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors" (PDF). Pattern Analysis and Applications. 5 (2): 102–112. doi:10.1007/s100440200009. S2CID 7415435.17. ^ Geurts P, Ernst D, Wehenkel L (2006). "Extremely randomized trees" (PDF). Machine Learning. 63: 3–42. doi:10.1007/s10994-006-6226-1.18. ^ Zhu R, Zeng D, Kosorok MR (2015). "Reinforcement Learning Trees". Journal of the American Statistical Association. 110 (512): 1770–1784. doi:10.1080/01621459.2015.1036994. PMC 4760114. PMID 26903687.19. ^ Deng, H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.20. ^ Altmann A, Toloşi L, Sander O, Lengauer T (May 2010). "Permutation importance: a corrected feature importance measure". Bioinformatics. 26 (10): 1340–7. doi:10.1093/bioinformatics/btq134. PMID 20385727.21. ^ Strobl C, Boulesteix A, Augustin T (2007). "Unbiased split selection for classification trees based on the Gini index" (PDF). Computational Statistics & Data Analysis. 52: 483–501. CiteSeerX 10.1.1.525.3178. doi:10.1016/j.csda.2006.12.030.22. ^ Painsky A, Rosset S (2017). "Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (11): 2142–2153. arXiv:1512.03444. doi:10.1109/tpami.2016.2636831. PMID 28114007. S2CID 5381516.23. ^ Tolosi L, Lengauer T (July 2011). "Classification with correlated features: unreliability of feature ranking and solutions". Bioinformatics. 27 (14): 1986–94. doi:10.1093/bioinformatics/btr300. PMID 21576180.24. ^ ^(a) ^(b) Lin, Yi; Jeon, Yongho (2002). Random forests and adaptive nearest neighbors (Technical report). Technical Report No. 1055. University of Wisconsin. CiteSeerX 10.1.1.153.9168.25. ^ Shi, T., Horvath, S. (2006). "Unsupervised Learning with Random Forest Predictors". Journal of Computational and Graphical Statistics. 15 (1): 118–138. CiteSeerX 10.1.1.698.2365. doi:10.1198/106186006X94072. JSTOR 27594168. S2CID 245216.{{cite journal}}: CS1 maint: uses authors parameter (link)26. ^ Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (April 2005). "Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma". Modern Pathology. 18 (4): 547–57. doi:10.1038/modpathol.3800322. PMID 15529185.27. ^ Dahmardeh, Mahyar; Mirzaalian Dastjerdi, Houman; Mazal, Hisham; Köstler, Harald; Sandoghdar, Vahid (2023-02-27). "Self-supervised machine learning pushes the sensitivity limit in label-free detection of single proteins below 10 kDa". Nature Methods. 20 (3): 442–447. doi:10.1038/s41592-023-01778-2. ISSN 1548-7105. PMID 36849549.28. ^ Prinzie, A., Van den Poel, D. (2008). "Random Forests for multiclass classification: Random MultiNomial Logit". Expert Systems with Applications. 34 (3): 1721–1732. doi:10.1016/j.eswa.2007.01.029.{{cite journal}}: CS1 maint: uses authors parameter (link)29. ^ Prinzie, Anita (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". In Roland Wagner; Norman Revell; Günther Pernul (eds.). Database and Expert Systems Applications: 18th International Conference, DEXA 2007, Regensburg, Germany, September 3-7, 2007, Proceedings. Lecture Notes in Computer Science. Vol. 4653. pp. 349–358. doi:10.1007/978-3-540-74469-6_35. ISBN 978-3-540-74467-2.30. ^ ^(a) ^(b) Smith, Paul F.; Ganesh, Siva; Liu, Ping (2013-10-01). "A comparison of random forest regression and multiple linear regression for prediction in neuroscience". Journal of Neuroscience Methods. 220 (1): 85–91. doi:10.1016/j.jneumeth.2013.08.024. PMID 24012917. S2CID 13195700.31. ^ ^(a) ^(b) ^(c) ^(d) Scornet, Erwan (2015). "Random forests and kernel methods". arXiv:1502.03836 [math.ST].32. ^ Breiman, Leo (2000). "Some infinity theory for predictor ensembles". Technical Report 579, Statistics Dept. UCB. {{cite journal}}: Cite journal requires |journal= (help)33. ^ Lin, Yi; Jeon, Yongho (2006). "Random forests and adaptive nearest neighbors". Journal of the American Statistical Association. 101 (474): 578–590. CiteSeerX 10.1.1.153.9168. doi:10.1198/016214505000001230. S2CID 2469856.34. ^ Davies, Alex; Ghahramani, Zoubin (2014). "The Random Forest Kernel and other kernels for big data from random partitions". arXiv:1402.4293 [stat.ML].35. ^ ^(a) ^(b) Breiman L, Ghahramani Z (2004). "Consistency for a simple model of random forests". Statistical Department, University of California at Berkeley. Technical Report (670). CiteSeerX 10.1.1.618.90.36. ^ ^(a) ^(b) Arlot S, Genuer R (2014). "Analysis of purely random forests bias". arXiv:1407.3939 [math.ST].37. ^ Sagi, Omer; Rokach, Lior (2020). "Explainable decision forest: Transforming a decision forest into an interpretable tree". Information Fusion. 61: 124–138. doi:10.1016/j.inffus.2020.03.013. S2CID 216444882.38. ^ Vidal, Thibaut; Schiffer, Maximilian (2020). "Born-Again Tree Ensembles". International Conference on Machine Learning. PMLR. 119: 9743–9753. arXiv:2003.11132.39. ^ Piryonesi, Sayed Madeh (November 2019). Piryonesi, S. M. (2019). The Application of Data Analytics to Asset Management: Deterioration and Climate Change Adaptation in Ontario Roads (Doctoral dissertation) (Thesis).Further reading[edit][]Scholia has a topic profile for Random forest.- Prinzie A, Poel D (2007). "Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB". Database and Expert Systems Applications. Lecture Notes in Computer Science. Vol. 4653. p. 349. doi:10.1007/978-3-540-74469-6_35. ISBN 978-3-540-74467-2.- Denisko D, Hoffman MM (February 2018). "Classification and interaction in random forests". Proceedings of the National Academy of Sciences of the United States of America. 115 (8): 1690–1692. Bibcode:2018PNAS..115.1690D. doi:10.1073/pnas.1800256115. PMC 5828645. PMID 29440440.External links[edit]- Random Forests classifier description (Leo Breiman's site)- Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R)Statistical modeling method+-----------------------------------------------------------------------+| Part of a series on |+-----------------------------------------------------------------------+| Regression analysis |+-----------------------------------------------------------------------+| Models |+-----------------------------------------------------------------------+| - Linear regression || - Simple regression || - Polynomial regression || - General linear model |+-----------------------------------------------------------------------+| - Generalized linear model || - Vector generalized linear model || - Discrete choice || - Binomial regression || - Binary regression || - Logistic regression || - Multinomial logistic regression || - Mixed logit || - Probit || - Multinomial probit || - Ordered logit || - Ordered probit || - Poisson |+-----------------------------------------------------------------------+| - Multilevel model || - Fixed effects || - Random effects || - Linear mixed-effects model || - Nonlinear mixed-effects model |+-----------------------------------------------------------------------+| - Nonlinear regression || - Nonparametric || - Semiparametric || - Robust || - Quantile || - Isotonic || - Principal components || - Least angle || - Local || - Segmented |+-----------------------------------------------------------------------+| - Errors-in-variables |+-----------------------------------------------------------------------+| Estimation |+-----------------------------------------------------------------------+| - Least squares || - Linear || - Non-linear |+-----------------------------------------------------------------------+| - Ordinary || - Weighted || - Generalized || - Generalized estimating equation |+-----------------------------------------------------------------------+| - Partial || - Total || - Non-negative || - Ridge regression || - Regularized |+-----------------------------------------------------------------------+| - Least absolute deviations || - Iteratively reweighted || - Bayesian || - Bayesian multivariate || - Least-squares spectral analysis |+-----------------------------------------------------------------------+| Background |+-----------------------------------------------------------------------+| - Regression validation || - Mean and predicted response || - Errors and residuals || - Goodness of fit || - Studentized residual || - Gauss–Markov theorem |+-----------------------------------------------------------------------+| - [icon] Mathematics portal |+-----------------------------------------------------------------------+| - v || - t || - e |+-----------------------------------------------------------------------+In statistics, linear regression is a linear approach for modelling therelationship between a scalar response and one or more explanatoryvariables (also known as dependent and independent variables). The caseof one explanatory variable is called simple linear regression; for morethan one, the process is called multiple linear regression.^([1]) Thisterm is distinct from multivariate linear regression, where multiplecorrelated dependent variables are predicted, rather than a singlescalar variable.^([2])In linear regression, the relationships are modeled using linearpredictor functions whose unknown model parameters are estimated fromthe data. Such models are called linear models.^([3]) Most commonly, theconditional mean of the response given the values of the explanatoryvariables (or predictors) is assumed to be an affine function of thosevalues; less commonly, the conditional median or some other quantile isused. Like all forms of regression analysis, linear regression focuseson the conditional probability distribution of the response given thevalues of the predictors, rather than on the joint probabilitydistribution of all of these variables, which is the domain ofmultivariate analysis.Linear regression was the first type of regression analysis to bestudied rigorously, and to be used extensively in practicalapplications.^([4]) This is because models which depend linearly ontheir unknown parameters are easier to fit than models which arenon-linearly related to their parameters and because the statisticalproperties of the resulting estimators are easier to determine.Linear regression has many practical uses. Most applications fall intoone of the following two broad categories:- If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.- If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.Linear regression models are often fitted using the least squaresapproach, but they may also be fitted in other ways, such as byminimizing the "lack of fit" in some other norm (as with least absolutedeviations regression), or by minimizing a penalized version of theleast squares cost function as in ridge regression (L²-norm penalty) andlasso (L¹-norm penalty). Conversely, the least squares approach can beused to fit models that are not linear models. Thus, although the terms"least squares" and "linear model" are closely linked, they are notsynonymous.Formulation[edit][]In linear regression, the observations (red) are assumed to be theresult of random deviations (green) from an underlying relationship(blue) between a dependent variable (y) and an independent variable (x).Given a data set{y_(i), x_(i1), …, x_(ip)}_(i = 1)^(n)[\{y_{i},\,x_{i1},\ldots,x_{ip}\}_{i=1}^{n}] of n statistical units, a linear regression modelassumes that the relationship between the dependent variable y and thevector of regressors x is linear. This relationship is modeled through adisturbance term or error variable ε — an unobserved random variablethat adds "noise" to the linear relationship between the dependentvariable and regressors. Thus the model takes the formy_(i) = β₀ + β₁x_(i1) + ⋯ + β_(p)x_(ip) + ε_(i) = x_(i)^(T)β + ε_(i), i = 1, …, n,[{\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta_{p}x_{ip}+\varepsilon _{i}=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol{\beta }}+\varepsilon _{i},\qquad i=1,\ldots ,n,}]where ^(T) denotes the transpose, so that x_(i)^(T)β is the innerproduct between vectors x_(i) and β.Often these n equations are stacked together and written in matrixnotation asy = Xβ + ε, [{\displaystyle \mathbf {y} =\mathbf {X} {\boldsymbol {\beta}}+{\boldsymbol {\varepsilon }},\,}]where$\mathbf{y} = \begin{bmatrix}y_{1} \\y_{2} \\ \vdots \\y_{n} \\\end{bmatrix},\quad$[{\displaystyle \mathbf {y}={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{bmatrix}},\quad }]$\mathbf{X} = \begin{bmatrix}\mathbf{x}_{1}^{\mathsf{T}} \\\mathbf{x}_{2}^{\mathsf{T}} \\ \vdots \\\mathbf{x}_{n}^{\mathsf{T}} \\\end{bmatrix} = \begin{bmatrix}1 & x_{11} & \cdots & x_{1p} \\1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\1 & x_{n1} & \cdots & x_{np} \\\end{bmatrix},$[{\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x}_{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x}_{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}1&x_{11}&\cdots&x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots\\1&x_{n1}&\cdots &x_{np}\end{bmatrix}},}]$\mathbf{\beta} = \begin{bmatrix}\beta_{0} \\\beta_{1} \\\beta_{2} \\ \vdots \\\beta_{p} \\\end{bmatrix},\quad\mathbf{\varepsilon} = \begin{bmatrix}\varepsilon_{1} \\\varepsilon_{2} \\ \vdots \\\varepsilon_{n} \\\end{bmatrix}.$[{\displaystyle {\boldsymbol {\beta}}={\begin{bmatrix}\beta _{0}\\\beta _{1}\\\beta _{2}\\\vdots \\\beta_{p}\end{bmatrix}},\quad {\boldsymbol {\varepsilon}}={\begin{bmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\vdots\\\varepsilon _{n}\end{bmatrix}}.}]Notation and terminology[edit]- y[\mathbf {y} ] is a vector of observed values y_(i) (i=1,…,n)[{\displaystyle y_{i}\ (i=1,\ldots ,n)}] of the variable called the regressand, endogenous variable, response variable, measured variable, criterion variable, or dependent variable. This variable is also sometimes known as the predicted variable, but this should not be confused with predicted values, which are denoted ŷ[{\hat {y}}]. The decision as to which variable in a data set is modeled as the dependent variable and which are modeled as the independent variables may be based on a presumption that the value of one of the variables is caused by, or directly influenced by the other variables. Alternatively, there may be an operational reason to model one of the variables in terms of the others, in which case there need be no presumption of causality.- X[\mathbf {X} ] may be seen as a matrix of row-vectors x_(i⋅)[{\displaystyle \mathbf {x} _{i\cdot }}] or of n-dimensional column-vectors x_(⋅j)[{\displaystyle \mathbf {x} _{\cdot j}}], which are known as regressors, exogenous variables, explanatory variables, covariates, input variables, predictor variables, or independent variables (not to be confused with the concept of independent random variables). The matrix X[\mathbf {X} ] is sometimes called the design matrix. - Usually a constant is included as one of the regressors. In particular, x_(i0) = 1[{\displaystyle x_{i0}=1}] for i = 1, …, n[i=1,\ldots ,n]. The corresponding element of β is called the intercept. Many statistical inference procedures for linear models require an intercept to be present, so it is often included even if theoretical considerations suggest that its value should be zero. - Sometimes one of the regressors can be a non-linear function of another regressor or of the data, as in polynomial regression and segmented regression. The model remains linear as long as it is linear in the parameter vector β. - The values x_(ij) may be viewed as either observed values of random variables X_(j) or as fixed values chosen prior to observing the dependent variable. Both interpretations may be appropriate in different cases, and they generally lead to the same estimation procedures; however different approaches to asymptotic analysis are used in these two situations.- β[{\boldsymbol {\beta }}] is a (p+1)[(p+1)]-dimensional parameter vector, where β₀[\beta _{0}] is the intercept term (if one is included in the model—otherwise β[{\boldsymbol {\beta }}] is p-dimensional). Its elements are known as effects or regression coefficients (although the latter term is sometimes reserved for the estimated effects). In simple linear regression, p=1, and the coefficient is known as regression slope. Statistical estimation and inference in linear regression focuses on β. The elements of this parameter vector are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables.- ε[{\displaystyle {\boldsymbol {\varepsilon }}}] is a vector of values ε_(i)[\varepsilon _{i}]. This part of the model is called the error term, disturbance term, or sometimes noise (in contrast with the "signal" provided by the rest of the model). This variable captures all other factors which influence the dependent variable y other than the regressors x. The relationship between the error term and the regressors, for example their correlation, is a crucial consideration in formulating a linear regression model, as it will determine the appropriate estimation method.Fitting a linear model to a given data set usually requires estimatingthe regression coefficients β[{\boldsymbol {\beta }}] such that theerror term ε = y − Xβ[{\displaystyle {\boldsymbol {\varepsilon}}=\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}}] is minimized. Forexample, it is common to use the sum of squared errors∥ε∥₂²[{\displaystyle \|{\boldsymbol {\varepsilon }}\|_{2}^{2}}] as ameasure of ε[{\displaystyle {\boldsymbol {\varepsilon }}}] forminimization.Example[edit]Consider a situation where a small ball is being tossed up in the airand then we measure its heights of ascent h_(i) at various moments intime t_(i). Physics tells us that, ignoring the drag, the relationshipcan be modeled ash_(i) = β₁t_(i) + β₂t_(i)² + ε_(i),[h_{i}=\beta _{1}t_{i}+\beta_{2}t_{i}^{2}+\varepsilon _{i},]where β₁ determines the initial velocity of the ball, β₂ is proportionalto the standard gravity, and ε_(i) is due to measurement errors. Linearregression can be used to estimate the values of β₁ and β₂ from themeasured data. This model is non-linear in the time variable, but it islinear in the parameters β₁ and β₂; if we take regressorsx_(i) = (x_(i1), x_(i2)) = (t_(i), t_(i)²), the model takes on thestandard formh_(i) = x_(i)^(T)β + ε_(i).[{\displaystyle h_{i}=\mathbf {x}_{i}^{\mathsf {T}}{\boldsymbol {\beta }}+\varepsilon _{i}.}]Assumptions[edit]See also: Ordinary least squares § AssumptionsStandard linear regression models with standard estimation techniquesmake a number of assumptions about the predictor variables, the responsevariables and their relationship. Numerous extensions have beendeveloped that allow each of these assumptions to be relaxed (i.e.reduced to a weaker form), and in some cases eliminated entirely.Generally these extensions make the estimation procedure more complexand time-consuming, and may also require more data in order to producean equally precise model.[]Example of a cubic polynomial regression, which is a type of linearregression. Although polynomial regression fits a nonlinear model to thedata, as a statistical estimation problem it is linear, in the sensethat the regression function E(y | x) is linear in the unknownparameters that are estimated from the data. For this reason, polynomialregression is considered to be a special case of multiple linearregression.The following are the major assumptions made by standard linearregression models with standard estimation techniques (e.g. ordinaryleast squares):- Weak exogeneity. This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free—that is, not contaminated with measurement errors. Although this assumption is not realistic in many settings, dropping it leads to significantly more difficult errors-in-variables models.- Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to overfit the data. As a result, some kind of regularization must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are ridge regression and lasso regression. Bayesian linear regression can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression can both be viewed as special cases of Bayesian linear regression, with particular types of prior distributions placed on the regression coefficients.)- Constant variance (a.k.a. homoscedasticity). This means that the variance of the errors does not depend on the values of the predictor variables. Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called heteroscedasticity. In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see Heteroscedasticity. The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of ordinary least squares, not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The mean squared error for the model will also be wrong. Various estimation techniques including weighted least squares and the use of heteroscedasticity-consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear regression techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the logarithm of the response variable using a linear regression model, which implies that the response variable itself has a log-normal distribution rather than a normal distribution).[]To check for violations of the assumptions of linearity, constantvariance, and independence of errors within a linear regression model,the residuals are typically plotted against the predicted values (oreach of the individual predictors). An apparently random scatter ofpoints about the horizontal midline at 0 is ideal, but cannot rule outcertain kinds of violations such as autocorrelation in the errors ortheir correlation with one or more covariates.- Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. (Actual statistical independence is a stronger condition than mere lack of correlation and is often not needed, although it can be exploited if it is known to hold.) Some methods such as generalized least squares are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this issue.- Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rank p; otherwise perfect multicollinearity exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see Variance inflation factor). In the case of perfect multicollinearity, the parameter vector β will be non-identifiable—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space R^(p)). See partial least squares regression. Methods for fitting linear models with multicollinearity have been developed,^([5][6][7][8]) some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in generalized linear models, do not suffer from this problem.Beyond these assumptions, several other statistical properties of thedata strongly influence the performance of different estimation methods:- The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.- The arrangement, or probability distribution of the predictor variables x has a major influence on the precision of estimates of β. Sampling and design of experiments are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of β.Interpretation[edit][]The data sets in the Anscombe's quartet are designed to haveapproximately the same linear regression line (as well as nearlyidentical means, standard deviations, and correlations) but aregraphically very different. This illustrates the pitfalls of relyingsolely on a fitted model to understand the relationship betweenvariables.A fitted linear regression model can be used to identify therelationship between a single predictor variable x_(j) and the responsevariable y when all the other predictor variables in the model are "heldfixed". Specifically, the interpretation of β_(j) is the expected changein y for a one-unit change in x_(j) when the other covariates are heldfixed—that is, the expected value of the partial derivative of y withrespect to x_(j). This is sometimes called the unique effect of x_(j) ony. In contrast, the marginal effect of x_(j) on y can be assessed usinga correlation coefficient or simple linear regression model relatingonly x_(j) to y; this effect is the total derivative of y with respectto x_(j).Care must be taken when interpreting regression results, as some of theregressors may not allow for marginal changes (such as dummy variables,or the intercept term), while others cannot be held fixed (recall theexample from the introduction: it would be impossible to "hold t_(i)fixed" and at the same time change the value of t_(i)²).It is possible that the unique effect can be nearly zero even when themarginal effect is large. This may imply that some other covariatecaptures all the information in x_(j), so that once that variable is inthe model, there is no contribution of x_(j) to the variation in y.Conversely, the unique effect of x_(j) can be large while its marginaleffect is nearly zero. This would happen if the other covariatesexplained a great deal of the variation of y, but they mainly explainvariation in a way that is complementary to what is captured by x_(j).In this case, including the other variables in the model reduces thepart of the variability of y that is unrelated to x_(j), therebystrengthening the apparent relationship with x_(j).The meaning of the expression "held fixed" may depend on how the valuesof the predictor variables arise. If the experimenter directly sets thevalues of the predictor variables according to a study design, thecomparisons of interest may literally correspond to comparisons amongunits whose predictor variables have been "held fixed" by theexperimenter. Alternatively, the expression "held fixed" can refer to aselection that takes place in the context of data analysis. In thiscase, we "hold a variable fixed" by restricting our attention to thesubsets of the data that happen to have a common value for the givenpredictor variable. This is the only interpretation of "held fixed" thatcan be used in an observational study.The notion of a "unique effect" is appealing when studying a complexsystem where multiple interrelated components influence the responsevariable. In some cases, it can literally be interpreted as the causaleffect of an intervention that is linked to the value of a predictorvariable. However, it has been argued that in many cases multipleregression analysis fails to clarify the relationships between thepredictor variables and the response variable when the predictors arecorrelated with each other and are not assigned following a studydesign.^([9])Group effects[edit]In a multiple linear regression modely = β₀ + β₁x₁ + ⋯ + β_(p)x_(p) + ε,[{\displaystyle y=\beta _{0}+\beta_{1}x_{1}+\cdots +\beta _{p}x_{p}+\varepsilon ,}]parameter β_(j)[\beta _{j}] of predictor variable x_(j)[x_{j}]represents the individual effect of x_(j)[x_{j}]. It has aninterpretation as the expected change in the response variable y[y] whenx_(j)[x_{j}] increases by one unit with other predictor variables heldconstant. When x_(j)[x_{j}] is strongly correlated with other predictorvariables, it is improbable that x_(j)[x_{j}] can increase by one unitwith other variables held constant. In this case, the interpretation ofβ_(j)[\beta _{j}] becomes problematic as it is based on an improbablecondition, and the effect of x_(j)[x_{j}] cannot be evaluated inisolation.For a group of predictor variables, say,{x₁, x₂, …, x_(q)}[{\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}}], agroup effect ξ(w)[{\displaystyle \xi (\mathbf {w} )}] is defined as alinear combination of their parametersξ(w) = w₁β₁ + w₂β₂ + ⋯ + w_(q)β_(q),[{\displaystyle \xi (\mathbf {w})=w_{1}\beta _{1}+w_{2}\beta _{2}+\dots +w_{q}\beta _{q},}]where w = (w₁,w₂,…,w_(q))^(⊺)[{\displaystyle \mathbf {w}=(w_{1},w_{2},\dots ,w_{q})^{\intercal }}] is a weight vector satisfying$\sum\limits_{j = 1}^{q}|w_{j}| = 1$[{\textstyle \sum_{j=1}^{q}|w_{j}|=1}]. Because of the constraint on w_(j)[{\displaystyle{w_{j}}}], ξ(w)[{\displaystyle \xi (\mathbf {w} )}] is also referred toas a normalized group effect. A group effect ξ(w)[{\displaystyle \xi(\mathbf {w} )}] has an interpretation as the expected change in y[y]when variables in the group x₁, x₂, …, x_(q)[{\displaystylex_{1},x_{2},\dots ,x_{q}}] change by the amountw₁, w₂, …, w_(q)[{\displaystyle w_{1},w_{2},\dots ,w_{q}}],respectively, at the same time with variables not in the group heldconstant. It generalizes the individual effect of a variable to a groupof variables in that (i[i]) if q = 1[q=1], then the group effect reducesto an individual effect, and (ii[{\displaystyle ii}]) ifw_(i) = 1[{\displaystyle w_{i}=1}] and w_(j) = 0[{\displaystylew_{j}=0}] for j ≠ i[j\neq i], then the group effect also reduces to anindividual effect.A group effect ξ(w)[{\displaystyle \xi (\mathbf {w})}] is said to be meaningful if the underlying simultaneous changes ofthe q[q] variables (w₁,w₂,…,w_(q))^(⊺)[{\displaystyle (w_{1},w_{2},\dots,w_{q})^{\intercal }}] is probable.Group effects provide a means to study the collective impact of stronglycorrelated predictor variables in linear regression models. Individualeffects of such variables are not well-defined as their parameters donot have good interpretations. Furthermore, when the sample size is notlarge, none of their parameters can be accurately estimated by the leastsquares regression due to the multicollinearity problem. Nevertheless,there are meaningful group effects that have good interpretations andcan be accurately estimated by the least squares regression. A simpleway to identify these meaningful group effects is to use an all positivecorrelations (APC) arrangement of the strongly correlated variablesunder which pairwise correlations among these variables are allpositive, and standardize all p[p] predictor variables in the model sothat they all have mean zero and length one. To illustrate this, supposethat {x₁, x₂, …, x_(q)}[{\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}}] isa group of strongly correlated variables in an APC arrangement and thatthey are not strongly correlated with predictor variables outside thegroup. Let y^(′)[y'] be the centred y[y] and x_(j)^(′)[{\displaystylex_{j}'}] be the standardized x_(j)[x_{j}]. Then, the standardized linearregression model isy^(′) = β₁^(′)x₁^(′) + ⋯ + β_(p)^(′)x_(p)^(′) + ε.[{\displaystyley'=\beta _{1}'x_{1}'+\cdots +\beta _{p}'x_{p}'+\varepsilon .}]Parameters β_(j)[\beta _{j}] in the original model, including β₀[\beta_{0}], are simple functions of β_(j)^(′)[{\displaystyle \beta _{j}'}] inthe standardized model. The standardization of variables does not changetheir correlations, so {x₁^(′), x₂^(′), …, x_(q)^(′)}[{\displaystyle\{x_{1}',x_{2}',\dots ,x_{q}'\}}] is a group of strongly correlatedvariables in an APC arrangement and they are not strongly correlatedwith other predictor variables in the standardized model. A group effectof {x₁^(′), x₂^(′), …, x_(q)^(′)}[{\displaystyle \{x_{1}',x_{2}',\dots,x_{q}'\}}] isξ^(′)(w) = w₁β₁^(′) + w₂β₂^(′) + ⋯ + w_(q)β_(q)^(′),[{\displaystyle \xi'(\mathbf {w} )=w_{1}\beta _{1}'+w_{2}\beta _{2}'+\dots +w_{q}\beta_{q}',}]and its minimum-variance unbiased linear estimator isξ̂^(′)(w) = w₁β̂₁^(′) + w₂β̂₂^(′) + ⋯ + w_(q)β̂_(q)^(′),[{\displaystyle{\hat {\xi }}'(\mathbf {w} )=w_{1}{\hat {\beta }}_{1}'+w_{2}{\hat {\beta}}_{2}'+\dots +w_{q}{\hat {\beta }}_{q}',}]where β̂_(j)^(′)[{\displaystyle {\hat {\beta }}_{j}'}] is the leastsquares estimator of β_(j)^(′)[{\displaystyle \beta _{j}'}]. Inparticular, the average group effect of the q[q] standardized variablesis$\xi_{A} = \frac{1}{q}(\beta_{1}^{\prime} + \beta_{2}^{\prime} + \cdots + \beta_{q}^{\prime}),$[{\displaystyle\xi _{A}={\frac {1}{q}}(\beta _{1}'+\beta _{2}'+\dots +\beta _{q}'),}]which has an interpretation as the expected change in y^(′)[y'] when allx_(j)^(′)[{\displaystyle x_{j}'}] in the strongly correlated groupincrease by (1/q)[{\displaystyle (1/q)}]th of a unit at the same timewith variables outside the group held constant. With strong positivecorrelations and in standardized units, variables in the group areapproximately equal, so they are likely to increase at the same time andin similar amount. Thus, the average group effect ξ_(A)[{\displaystyle\xi _{A}}] is a meaningful effect. It can be accurately estimated by itsminimum-variance unbiased linear estimator${\hat{\xi}}_{A} = \frac{1}{q}({\hat{\beta}}_{1}^{\prime} + {\hat{\beta}}_{2}^{\prime} + \cdots + {\hat{\beta}}_{q}^{\prime})$[{\textstyle{\hat {\xi }}_{A}={\frac {1}{q}}({\hat {\beta }}_{1}'+{\hat {\beta}}_{2}'+\dots +{\hat {\beta }}_{q}')}], even when individually none ofthe β_(j)^(′)[{\displaystyle \beta _{j}'}] can be accurately estimatedby β̂_(j)^(′)[{\displaystyle {\hat {\beta }}_{j}'}].Not all group effects are meaningful or can be accurately estimated. Forexample, β₁^(′)[{\displaystyle \beta _{1}'}] is a special group effectwith weights w₁ = 1[{\displaystyle w_{1}=1}] andw_(j) = 0[{\displaystyle w_{j}=0}] for j ≠ 1[{\displaystyle j\neq 1}],but it cannot be accurately estimated by β̂₁^(′)[{\displaystyle {\hat{\beta }}'_{1}}]. It is also not a meaningful effect. In general, for agroup of q[q] strongly correlated predictor variables in an APCarrangement in the standardized model, group effects whose weightvectors w[\mathbf {w} ] are at or near the centre of the simplex$\sum\limits_{j = 1}^{q}w_{j} = 1$[{\textstyle \sum _{j=1}^{q}w_{j}=1}](w_(j) ≥ 0[{\displaystyle w_{j}\geq 0}]) are meaningful and can beaccurately estimated by their minimum-variance unbiased linearestimators. Effects with weight vectors far away from the centre are notmeaningful as such weight vectors represent simultaneous changes of thevariables that violate the strong positive correlations of thestandardized variables in an APC arrangement. As such, they are notprobable. These effects also cannot be accurately estimated.Applications of the group effects include (1) estimation and inferencefor meaningful group effects on the response variable, (2) testing for"group significance" of the q[q] variables via testingH₀ : ξ_(A) = 0[{\displaystyle H_{0}:\xi _{A}=0}] versusH₁ : ξ_(A) ≠ 0[{\displaystyle H_{1}:\xi _{A}\neq 0}], and (3)characterizing the region of the predictor variable space over whichpredictions by the least squares estimated model are accurate.A group effect of the original variables{x₁, x₂, …, x_(q)}[{\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}}] can beexpressed as a constant times a group effect of the standardizedvariables {x₁^(′), x₂^(′), …, x_(q)^(′)}[{\displaystyle\{x_{1}',x_{2}',\dots ,x_{q}'\}}]. The former is meaningful when thelatter is. Thus meaningful group effects of the original variables canbe found through meaningful group effects of the standardizedvariables.^([10])Extensions[edit]Numerous extensions of linear regression have been developed, whichallow some or all of the assumptions underlying the basic model to berelaxed.Simple and multiple linear regression[edit][]Example of simple linear regression, which has one independent variableThe very simplest case of a single scalar predictor variable x and asingle scalar response variable y is known as simple linear regression.The extension to multiple and/or vector-valued predictor variables(denoted with a capital X) is known as multiple linear regression, alsoknown as multivariable linear regression (not to be confused withmultivariate linear regression^([11])).Multiple linear regression is a generalization of simple linearregression to the case of more than one independent variable, and aspecial case of general linear models, restricted to one dependentvariable. The basic model for multiple linear regression isY_(i) = β₀ + β₁X_(i1) + β₂X_(i2) + … + β_(p)X_(ip) + ϵ_(i)[{\displaystyleY_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta_{p}X_{ip}+\epsilon _{i}}]for each observation i = 1, ... , n.In the formula above we consider n observations of one dependentvariable and p independent variables. Thus, Y_(i) is the i^(th)observation of the dependent variable, X_(ij) is i^(th) observation ofthe j^(th) independent variable, j = 1, 2, ..., p. The values β_(j)represent parameters to be estimated, and ε_(i) is the i^(th)independent identically distributed normal error.In the more general multivariate linear regression, there is oneequation of the above form for each of m > 1 dependent variables thatshare the same set of explanatory variables and hence are estimatedsimultaneously with each other:Y_(ij) = β_(0j) + β_(1j)X_(i1) + β_(2j)X_(i2) + … + β_(pj)X_(ip) + ϵ_(ij)[{\displaystyleY_{ij}=\beta _{0j}+\beta _{1j}X_{i1}+\beta _{2j}X_{i2}+\ldots +\beta_{pj}X_{ip}+\epsilon _{ij}}]for all observations indexed as i = 1, ... , n and for all dependentvariables indexed as j = 1, ... , m.Nearly all real-world regression models involve multiple predictors, andbasic descriptions of linear regression are often phrased in terms ofthe multiple regression model. Note, however, that in these cases theresponse variable y is still a scalar. Another term, multivariate linearregression, refers to cases where y is a vector, i.e., the same asgeneral linear regression.General linear models[edit]The general linear model considers the situation when the responsevariable is not a scalar (for each observation) but a vector, y_(i).Conditional linearity of E(y∣x_(i)) = x_(i)^(T)B[{\displaystyleE(\mathbf {y} \mid \mathbf {x} _{i})=\mathbf {x} _{i}^{\mathsf {T}}B}]is still assumed, with a matrix B replacing the vector β of theclassical linear regression model. Multivariate analogues of ordinaryleast squares (OLS) and generalized least squares (GLS) have beendeveloped. "General linear models" are also called "multivariate linearmodels". These are not the same as multivariable linear models (alsocalled "multiple linear models").Heteroscedastic models[edit]Various models have been created that allow for heteroscedasticity, i.e.the errors for different response variables may have differentvariances. For example, weighted least squares is a method forestimating linear regression models when the response variables may havedifferent error variances, possibly with correlated errors. (See alsoWeighted linear least squares, and Generalized least squares.)Heteroscedasticity-consistent standard errors is an improved method foruse with uncorrelated but potentially heteroscedastic errors.Generalized linear models[edit]Generalized linear models (GLMs) are a framework for modeling responsevariables that are bounded or discrete. This is used, for example:- when modeling positive quantities (e.g. prices or populations) that vary over a large scale—which are better described using a skewed distribution such as the log-normal distribution or Poisson distribution (although GLMs are not used for log-normal data, instead the response variable is simply transformed using the logarithm function);- when modeling categorical data, such as the choice of a given candidate in an election (which is better described using a Bernoulli distribution/binomial distribution for binary choices, or a categorical distribution/multinomial distribution for multi-way choices), where there are a fixed number of choices that cannot be meaningfully ordered;- when modeling ordinal data, e.g. ratings on a scale from 0 to 5, where the different outcomes can be ordered but where the quantity itself may not have any absolute meaning (e.g. a rating of 4 may not be "twice as good" in any objective sense as a rating of 2, but simply indicates that it is better than 2 or 3 but not as good as 5).Generalized linear models allow for an arbitrary link function, g, thatrelates the mean of the response variable(s) to the predictors:E(Y) = g⁻¹(XB)[{\displaystyle E(Y)=g^{-1}(XB)}]. The link function isoften related to the distribution of the response, and in particular ittypically has the effect of transforming between the (−∞,∞)[(-\infty,\infty )] range of the linear predictor and the range of the responsevariable.Some common examples of GLMs are:- Poisson regression for count data.- Logistic regression and probit regression for binary data.- Multinomial logistic regression and multinomial probit regression for categorical data.- Ordered logit and ordered probit regression for ordinal data.Single index models^([clarification needed]) allow some degree ofnonlinearity in the relationship between x and y, while preserving thecentral role of the linear predictor β′x as in the classical linearregression model. Under certain conditions, simply applying OLS to datafrom a single-index model will consistently estimate β up to aproportionality constant.^([12])Hierarchical linear models[edit]Hierarchical linear models (or multilevel regression) organizes the datainto a hierarchy of regressions, for example where A is regressed on B,and B is regressed on C. It is often used where the variables ofinterest have a natural hierarchical structure such as in educationalstatistics, where students are nested in classrooms, classrooms arenested in schools, and schools are nested in some administrativegrouping, such as a school district. The response variable might be ameasure of student achievement such as a test score, and differentcovariates would be collected at the classroom, school, and schooldistrict levels.Errors-in-variables[edit]Errors-in-variables models (or "measurement error models") extend thetraditional linear regression model to allow the predictor variables Xto be observed with error. This error causes standard estimators of β tobecome biased. Generally, the form of bias is an attenuation, meaningthat the effects are biased toward zero.Others[edit]- In Dempster–Shafer theory, or a linear belief function in particular, a linear regression model may be represented as a partially swept matrix, which can be combined with similar matrices representing observations and other assumed normal distributions and state equations. The combination of swept or unswept matrices provides an alternative method for estimating linear regression models.Estimation methods[edit]A large number of procedures have been developed for parameterestimation and inference in linear regression. These methods differ incomputational simplicity of algorithms, presence of a closed-formsolution, robustness with respect to heavy-tailed distributions, andtheoretical assumptions needed to validate desirable statisticalproperties such as consistency and asymptotic efficiency.Some of the more common estimation techniques for linear regression aresummarized below.Least-squares estimation and related techniques[edit]Main article: Linear least squares[]Francis Galton's 1886^([13]) illustration of the correlation between theheights of adults and their parents. The observation that adultchildren's heights tended to deviate less from the mean height thantheir parents suggested the concept of "regression toward the mean",giving regression its name. The "locus of horizontal tangential points"passing through the leftmost and rightmost points on the ellipse (whichis a level curve of the bivariate normal distribution estimated from thedata) is the OLS estimate of the regression of parents' heights onchildren's heights, while the "locus of vertical tangential points" isthe OLS estimate of the regression of children's heights on parent'sheights. The major axis of the ellipse is the TLS estimate.Assuming that the independent variable is$\overset{\rightarrow}{x_{i}} = \left\lbrack {x_{1}^{i},x_{2}^{i},\ldots,x_{m}^{i}} \right\rbrack$[{\displaystyle{\vec {x_{i}}}=\left[x_{1}^{i},x_{2}^{i},\ldots ,x_{m}^{i}\right]}] andthe model's parameters are$\overset{\rightarrow}{\beta} = \left\lbrack {\beta_{0},\beta_{1},\ldots,\beta_{m}} \right\rbrack$[{\displaystyle{\vec {\beta }}=\left[\beta _{0},\beta _{1},\ldots ,\beta _{m}\right]}],then the model's prediction would be$y_{i} \approx \beta_{0} + \sum\limits_{j = 1}^{m}\beta_{j} \times x_{j}^{i}$[{\displaystyley_{i}\approx \beta _{0}+\sum _{j=1}^{m}\beta _{j}\times x_{j}^{i}}].If $\overset{\rightarrow}{x_{i}}$[\vec{x_i}] is extended to$\overset{\rightarrow}{x_{i}} = \left\lbrack {1,x_{1}^{i},x_{2}^{i},\ldots,x_{m}^{i}} \right\rbrack$[{\displaystyle{\vec {x_{i}}}=\left[1,x_{1}^{i},x_{2}^{i},\ldots ,x_{m}^{i}\right]}]then y_(i)[y_{i}] would become a dot product of the parameter and theindependent variable, i.e.$y_{i} \approx \sum\limits_{j = 0}^{m}\beta_{j} \times x_{j}^{i} = \overset{\rightarrow}{\beta} \cdot \overset{\rightarrow}{x_{i}}$[{\displaystyley_{i}\approx \sum _{j=0}^{m}\beta _{j}\times x_{j}^{i}={\vec {\beta}}\cdot {\vec {x_{i}}}}].In the least-squares setting, the optimum parameter is defined as suchthat minimizes the sum of mean squared loss:$\overset{\rightarrow}{\hat{\beta}} = \underset{\overset{\rightarrow}{\beta}}{\text{arg\ min}}\, L\left( {D,\overset{\rightarrow}{\beta}} \right) = \underset{\overset{\rightarrow}{\beta}}{\text{arg\ min}}\sum\limits_{i = 1}^{n}\left( {\overset{\rightarrow}{\beta} \cdot \overset{\rightarrow}{x_{i}} - y_{i}} \right)^{2}$[{\displaystyle{\vec {\hat {\beta }}}={\underset {\vec {\beta }}{\mbox{argmin}}}\,L\left(D,{\vec {\beta }}\right)={\underset {\vec {\beta}}{\mbox{arg min}}}\sum _{i=1}^{n}\left({\vec {\beta }}\cdot {\vec{x_{i}}}-y_{i}\right)^{2}}]Now putting the independent and dependent variables in matrices X[X] andY[Y] respectively, the loss function can be rewritten as:$\begin{matrix}{L\left( {D,\overset{\rightarrow}{\beta}} \right)} & {= \| X\overset{\rightarrow}{\beta} - Y\|^{2}} \\ & {= \left( {X\overset{\rightarrow}{\beta} - Y} \right)^{\textsf{T}}\left( {X\overset{\rightarrow}{\beta} - Y} \right)} \\ & {= Y^{\textsf{T}}Y - Y^{\textsf{T}}X\overset{\rightarrow}{\beta} - {\overset{\rightarrow}{\beta}}^{\textsf{T}}X^{\textsf{T}}Y + {\overset{\rightarrow}{\beta}}^{\textsf{T}}X^{\textsf{T}}X\overset{\rightarrow}{\beta}} \\\end{matrix}$[{\displaystyle {\begin{aligned}L\left(D,{\vec {\beta}}\right)&=\|X{\vec {\beta }}-Y\|^{2}\\&=\left(X{\vec {\beta}}-Y\right)^{\textsf {T}}\left(X{\vec {\beta }}-Y\right)\\&=Y^{\textsf{T}}Y-Y^{\textsf {T}}X{\vec {\beta }}-{\vec {\beta }}^{\textsf{T}}X^{\textsf {T}}Y+{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}X{\vec{\beta }}\end{aligned}}}]As the loss is convex the optimum solution lies at gradient zero. Thegradient of the loss function is (using Denominator layout convention):$\begin{matrix}\frac{\partial L\left( {D,\overset{\rightarrow}{\beta}} \right)}{\partial\overset{\rightarrow}{\beta}} & {= \frac{\partial\left( {Y^{\textsf{T}}Y - Y^{\textsf{T}}X\overset{\rightarrow}{\beta} - {\overset{\rightarrow}{\beta}}^{\textsf{T}}X^{\textsf{T}}Y + {\overset{\rightarrow}{\beta}}^{\textsf{T}}X^{\textsf{T}}X\overset{\rightarrow}{\beta}} \right)}{\partial\overset{\rightarrow}{\beta}}} \\ & {= - 2X^{\textsf{T}}Y + 2X^{\textsf{T}}X\overset{\rightarrow}{\beta}} \\\end{matrix}$[{\displaystyle {\begin{aligned}{\frac {\partialL\left(D,{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}&={\frac{\partial \left(Y^{\textsf {T}}Y-Y^{\textsf {T}}X{\vec {\beta }}-{\vec{\beta }}^{\textsf {T}}X^{\textsf {T}}Y+{\vec {\beta }}^{\textsf{T}}X^{\textsf {T}}X{\vec {\beta }}\right)}{\partial {\vec {\beta}}}}\\&=-2X^{\textsf {T}}Y+2X^{\textsf {T}}X{\vec {\beta}}\end{aligned}}}]Setting the gradient to zero produces the optimum parameter:$\begin{matrix}{- 2X^{\textsf{T}}Y + 2X^{\textsf{T}}X\overset{\rightarrow}{\beta}} & {= 0} \\{\Rightarrow X^{\textsf{T}}X\overset{\rightarrow}{\beta}} & {= X^{\textsf{T}}Y} \\{\Rightarrow\overset{\rightarrow}{\hat{\beta}}} & {= \left( {X^{\textsf{T}}X} \right)^{- 1}X^{\textsf{T}}Y} \\\end{matrix}$[{\displaystyle {\begin{aligned}-2X^{\textsf{T}}Y+2X^{\textsf {T}}X{\vec {\beta }}&=0\\\Rightarrow X^{\textsf{T}}X{\vec {\beta }}&=X^{\textsf {T}}Y\\\Rightarrow {\vec {\hat {\beta}}}&=\left(X^{\textsf {T}}X\right)^{-1}X^{\textsf {T}}Y\end{aligned}}}]Note: To prove that the β̂[\hat{\beta}] obtained is indeed the localminimum, one needs to differentiate once more to obtain the Hessianmatrix and show that it is positive definite. This is provided by theGauss–Markov theorem.Linear least squares methods include mainly:- Ordinary least squares- Weighted least squares- Generalized least squaresMaximum-likelihood estimation and related techniques[edit]- Maximum likelihood estimation can be performed when the distribution of the error terms is known to belong to a certain parametric family ƒ_(θ) of probability distributions.^([14]) When f_(θ) is a normal distribution with zero mean and variance θ, the resulting estimate is identical to the OLS estimate. GLS estimates are maximum likelihood estimates when ε follows a multivariate normal distribution with a known covariance matrix.- Ridge regression^([15][16][17]) and other forms of penalized estimation, such as Lasso regression,^([5]) deliberately introduce bias into the estimation of β in order to reduce the variability of the estimate. The resulting estimates generally have lower mean squared error than the OLS estimates, particularly when multicollinearity is present or when overfitting is a problem. They are generally used when the goal is to predict the value of the response variable y for values of the predictors x that have not yet been observed. These methods are not as commonly used when the goal is inference, since it is difficult to account for the bias.- Least absolute deviation (LAD) regression is a robust estimation technique in that it is less sensitive to the presence of outliers than OLS (but is less efficient than OLS when no outliers are present). It is equivalent to maximum likelihood estimation under a Laplace distribution model for ε.^([18])- Adaptive estimation. If we assume that error terms are independent of the regressors, ε_(i)⊥x_(i)[\varepsilon _{i}\perp \mathbf {x} _{i}], then the optimal estimator is the 2-step MLE, where the first step is used to non-parametrically estimate the distribution of the error term.^([19])Other estimation techniques[edit][]Comparison of the Theil–Sen estimator (black) and simple linearregression (blue) for a set of points with outliers- Bayesian linear regression applies the framework of Bayesian statistics to linear regression. (See also Bayesian multivariate linear regression.) In particular, the regression coefficients β are assumed to be random variables with a specified prior distribution. The prior distribution can bias the solutions for the regression coefficients, in a way similar to (but more general than) ridge regression or lasso regression. In addition, the Bayesian estimation process produces not a single point estimate for the "best" values of the regression coefficients but an entire posterior distribution, completely describing the uncertainty surrounding the quantity. This can be used to estimate the "best" coefficients using the mean, mode, median, any quantile (see quantile regression), or any other function of the posterior distribution.- Quantile regression focuses on the conditional quantiles of y given X rather than the conditional mean of y given X. Linear quantile regression models a particular conditional quantile, for example the conditional median, as a linear function β^(T)x of the predictors.- Mixed models are widely used to analyze linear regression relationships involving dependent data when the dependencies have a known structure. Common applications of mixed models include analysis of data involving repeated measurements, such as longitudinal data, or data obtained from cluster sampling. They are generally fit as parametric models, using maximum likelihood or Bayesian estimation. In the case where the errors are modeled as normal random variables, there is a close connection between mixed models and generalized least squares.^([20]) Fixed effects estimation is an alternative approach to analyzing this type of data.- Principal component regression (PCR)^([7][8]) is used when the number of predictor variables is large, or when strong correlations exist among the predictor variables. This two-stage procedure first reduces the predictor variables using principal component analysis, and then uses the reduced variables in an OLS regression fit. While it often works well in practice, there is no general theoretical reason that the most informative linear function of the predictor variables should lie among the dominant principal components of the multivariate distribution of the predictor variables. The partial least squares regression is the extension of the PCR method which does not suffer from the mentioned deficiency.- Least-angle regression^([6]) is an estimation procedure for linear regression models that was developed to handle high-dimensional covariate vectors, potentially with more covariates than observations.- The Theil–Sen estimator is a simple robust estimation technique that chooses the slope of the fit line to be the median of the slopes of the lines through pairs of sample points. It has similar statistical efficiency properties to simple linear regression but is much less sensitive to outliers.^([21])- Other robust estimation techniques, including the α-trimmed mean approach^([citation needed]), and L-, M-, S-, and R-estimators have been introduced.^([citation needed])Applications[edit]See also: Linear least squares § ApplicationsLinear regression is widely used in biological, behavioral and socialsciences to describe possible relationships between variables. It ranksas one of the most important tools used in these disciplines.Trend line[edit]Main article: Trend estimationA trend line represents a trend, the long-term movement in time seriesdata after other components have been accounted for. It tells whether aparticular data set (say GDP, oil prices or stock prices) have increasedor decreased over the period of time. A trend line could simply be drawnby eye through a set of data points, but more properly their positionand slope is calculated using statistical techniques like linearregression. Trend lines typically are straight lines, although somevariations use higher degree polynomials depending on the degree ofcurvature desired in the line.Trend lines are sometimes used in business analytics to show changes indata over time. This has the advantage of being simple. Trend lines areoften used to argue that a particular action or event (such as training,or an advertising campaign) caused observed changes at a point in time.This is a simple technique, and does not require a control group,experimental design, or a sophisticated analysis technique. However, itsuffers from a lack of scientific validity in cases where otherpotential changes can affect the data.Epidemiology[edit]Early evidence relating tobacco smoking to mortality and morbidity camefrom observational studies employing regression analysis. In order toreduce spurious correlations when analyzing observational data,researchers usually include several variables in their regression modelsin addition to the variable of primary interest. For example, in aregression model in which cigarette smoking is the independent variableof primary interest and the dependent variable is lifespan measured inyears, researchers might include education and income as additionalindependent variables, to ensure that any observed effect of smoking onlifespan is not due to those other socio-economic factors. However, itis never possible to include all possible confounding variables in anempirical analysis. For example, a hypothetical gene might increasemortality and also cause people to smoke more. For this reason,randomized controlled trials are often able to generate more compellingevidence of causal relationships than can be obtained using regressionanalyses of observational data. When controlled experiments are notfeasible, variants of regression analysis such as instrumental variablesregression may be used to attempt to estimate causal relationships fromobservational data.Finance[edit]The capital asset pricing model uses linear regression as well as theconcept of beta for analyzing and quantifying the systematic risk of aninvestment. This comes directly from the beta coefficient of the linearregression model that relates the return on the investment to the returnon all risky assets.Economics[edit]Main article: EconometricsLinear regression is the predominant empirical tool in economics. Forexample, it is used to predict consumption spending,^([22]) fixedinvestment spending, inventory investment, purchases of a country'sexports,^([23]) spending on imports,^([23]) the demand to hold liquidassets,^([24]) labor demand,^([25]) and labor supply.^([25])Environmental science[edit]+-----------------------------------+-----------------------------------+| [[icon]] | This section needs expansion. You || | can help by adding to it. || | (January 2010) |+-----------------------------------+-----------------------------------+Linear regression finds application in a wide range of environmentalscience applications. In Canada, the Environmental Effects MonitoringProgram uses statistical analyses on fish and benthic surveys to measurethe effects of pulp mill or metal mine effluent on the aquaticecosystem.^([26])Machine learning[edit]Linear regression plays an important role in the subfield of artificialintelligence known as machine learning. The linear regression algorithmis one of the fundamental supervised machine-learning algorithms due toits relative simplicity and well-known properties.^([27])History[edit]Least squares linear regression, as a means of finding a good roughlinear fit to a set of points was performed by Legendre (1805) and Gauss(1809) for the prediction of planetary movement. Quetelet wasresponsible for making the procedure well-known and for using itextensively in the social sciences.^([28])See also[edit]- [icon]Mathematics portal- Analysis of variance- Blinder–Oaxaca decomposition- Censored regression model- Cross-sectional regression- Curve fitting- Empirical Bayes method- Errors and residuals- Lack-of-fit sum of squares- Line fitting- Linear classifier- Linear equation- Logistic regression- M-estimator- Multivariate adaptive regression spline- Nonlinear regression- Nonparametric regression- Normal equations- Projection pursuit regression- Response modeling methodology- Segmented linear regression- Standard deviation line- Stepwise regression- Structural break- Support vector machine- Truncated regression model- Deming regressionReferences[edit]Citations[edit]1. ^ David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press. p. 26. “A simple regression equation has on the right hand side an intercept and an explanatory variable with a slope coefficient. A multiple regression e right hand side, each with its own slope coefficient”2. ^ Rencher, Alvin C.; Christensen, William F. (2012), "Chapter 10, Multivariate regression – Section 10.1, Introduction", Methods of Multivariate Analysis, Wiley Series in Probability and Statistics, vol. 709 (3rd ed.), John Wiley & Sons, p. 19, ISBN 9781118391679.3. ^ Hilary L. Seal (1967). "The historical development of the Gauss linear model". Biometrika. 54 (1/2): 1–24. doi:10.1093/biomet/54.1-2.1. JSTOR 2333849.4. ^ Yan, Xin (2009), Linear Regression Analysis: Theory and Computing, World Scientific, pp. 1–2, ISBN 9789812834119, “Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.”5. ^ ^(a) ^(b) Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso". Journal of the Royal Statistical Society, Series B. 58 (1): 267–288. JSTOR 2346178.6. ^ ^(a) ^(b) Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression". The Annals of Statistics. 32 (2): 407–451. arXiv:math/0406456. doi:10.1214/009053604000000067. JSTOR 3448465. S2CID 204004121.7. ^ ^(a) ^(b) Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". Journal of the Royal Statistical Society, Series C. 22 (3): 275–286. doi:10.2307/2346776. JSTOR 2346776.8. ^ ^(a) ^(b) Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C. 31 (3): 300–303. doi:10.2307/2348005. JSTOR 2348005.9. ^ Berk, Richard A. (2007). "Regression Analysis: A Constructive Critique". Criminal Justice Review. 32 (3): 301–302. doi:10.1177/0734016807304871. S2CID 145389362.10. ^ Tsao, Min (2022). "Group least squares regression for linear models with strongly correlated predictor variables". Annals of the Institute of Statistical Mathematics. 75 (2): 233–250. arXiv:1804.02499. doi:10.1007/s10463-022-00841-7. S2CID 237396158.11. ^ Hidalgo, Bertha; Goodman, Melody (2012-11-15). "Multivariate or Multivariable Regression?". American Journal of Public Health. 103 (1): 39–40. doi:10.2105/AJPH.2012.300897. ISSN 0090-0036. PMC 3518362. PMID 23153131.12. ^ Brillinger, David R. (1977). "The Identification of a Particular Nonlinear Time Series System". Biometrika. 64 (3): 509–515. doi:10.1093/biomet/64.3.509. JSTOR 2345326.13. ^ Galton, Francis (1886). "Regression Towards Mediocrity in Hereditary Stature". The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246–263. doi:10.2307/2841583. ISSN 0959-5295. JSTOR 2841583.14. ^ Lange, Kenneth L.; Little, Roderick J. A.; Taylor, Jeremy M. G. (1989). "Robust Statistical Modeling Using the t Distribution" (PDF). Journal of the American Statistical Association. 84 (408): 881–896. doi:10.2307/2290063. JSTOR 2290063.15. ^ Swindel, Benee F. (1981). "Geometry of Ridge Regression Illustrated". The American Statistician. 35 (1): 12–15. doi:10.2307/2683577. JSTOR 2683577.16. ^ Draper, Norman R.; van Nostrand; R. Craig (1979). "Ridge Regression and James-Stein Estimation: Review and Comments". Technometrics. 21 (4): 451–466. doi:10.2307/1268284. JSTOR 1268284.17. ^ Hoerl, Arthur E.; Kennard, Robert W.; Hoerl, Roger W. (1985). "Practical Use of Ridge Regression: A Challenge Met". Journal of the Royal Statistical Society, Series C. 34 (2): 114–120. JSTOR 2347363.18. ^ Narula, Subhash C.; Wellington, John F. (1982). "The Minimum Sum of Absolute Errors Regression: A State of the Art Survey". International Statistical Review. 50 (3): 317–326. doi:10.2307/1402501. JSTOR 1402501.19. ^ Stone, C. J. (1975). "Adaptive maximum likelihood estimators of a location parameter". The Annals of Statistics. 3 (2): 267–284. doi:10.1214/aos/1176343056. JSTOR 2958945.20. ^ Goldstein, H. (1986). "Multilevel Mixed Linear Model Analysis Using Iterative Generalized Least Squares". Biometrika. 73 (1): 43–56. doi:10.1093/biomet/73.1.43. JSTOR 2336270.21. ^ Theil, H. (1950). "A rank-invariant method of linear and polynomial regression analysis. I, II, III". Nederl. Akad. Wetensch., Proc. 53: 386–392, 521–525, 1397–1412. MR 0036489.; Sen, Pranab Kumar (1968). "Estimates of the regression coefficient based on Kendall's tau". Journal of the American Statistical Association. 63 (324): 1379–1389. doi:10.2307/2285891. JSTOR 2285891. MR 0258201..22. ^ Deaton, Angus (1992). Understanding Consumption. Oxford University Press. ISBN 978-0-19-828824-4.23. ^ ^(a) ^(b) Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. (2012). International Economics: Theory and Policy (9th global ed.). Harlow: Pearson. ISBN 9780273754091.24. ^ Laidler, David E. W. (1993). The Demand for Money: Theories, Evidence, and Problems (4th ed.). New York: Harper Collins. ISBN 978-0065010985.25. ^ ^(a) ^(b) Ehrenberg; Smith (2008). Modern Labor Economics (10th international ed.). London: Addison-Wesley. ISBN 9780321538963.26. ^ EEMP webpage Archived 2011-06-11 at the Wayback Machine27. ^ "Linear Regression (Machine Learning)" (PDF). University of Pittsburgh.28. ^ Stigler, Stephen M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge: Harvard. ISBN 0-674-40340-1.Sources[edit]- Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates- Charles Darwin. The Variation of Animals and Plants under Domestication. (1868) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)- Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN 978-0-471-17082-2.- Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [1])- Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).Further reading[edit]- Pedhazur, Elazar J (1982). Multiple regression in behavioral research: Explanation and prediction (2nd ed.). New York: Holt, Rinehart and Winston. ISBN 978-0-03-041760-3.- Mathieu Rouaud, 2013: Probability, Statistics and Estimation Chapter 2: Linear Regression, Linear Regression with Error Bars and Nonlinear Regression.- National Physical Laboratory (1961). "Chapter 1: Linear Equations and Matrices: Direct Methods". Modern Computing Methods. Notes on Applied Science. Vol. 16 (2nd ed.). Her Majesty's Stationery Office.External links[edit][]Wikiversity has learning resources about Linear regression[]The Wikibook R Programming has a page on the topic of: Linear Models[]Wikimedia Commons has media related to Linear regression.- Least-Squares Regression, PhET Interactive simulations, University of Colorado at Boulder- DIY Linear Fit+-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || Least squares and regression | || analysis | |+-----------------------------------+-----------------------------------+| Computational statistics | - Least squares || | - Linear least squares || | - Non-linear least squares || | - Iteratively reweighted least || | squares |+-----------------------------------+-----------------------------------+| Correlation and dependence | - Pearson product-moment || | correlation || | - Rank correlation (Spearman's || | rho || | - Kendall's tau) || | - Partial correlation || | - Confounding variable |+-----------------------------------+-----------------------------------+| Regression analysis | - Ordinary least squares || | - Partial least squares || | - Total least squares || | - Ridge regression |+-----------------------------------+-----------------------------------+| Regression as a | +--------------+--------------+ || statistical model | | Linear | - Simple | || | | regression | linear | || | | | | || | | | regression | || | | | - Ordinary | || | | | least | || | | | squares | || | | | - | || | | | Generalized | || | | | least | || | | | squares | || | | | - Weighted | || | | | least | || | | | squares | || | | | - General | || | | | linear | || | | | model | || | +--------------+--------------+ || | | Predictor | - | || | | structure | Polynomial | || | | | | || | | | regression | || | | | - Growth | || | | | curve | || | | | | || | | | (statistics) | || | | | - | || | | | Segmented | || | | | | || | | | regression | || | | | - Local | || | | | | || | | | regression | || | +--------------+--------------+ || | | Non-standard | - | || | | | Nonlinear | || | | | | || | | | regression | || | | | - N | || | | | onparametric | || | | | - Se | || | | | miparametric | || | | | - Robust | || | | | - Quantile | || | | | - Isotonic | || | +--------------+--------------+ || | | Non-normal | - | || | | errors | Generalized | || | | | linear | || | | | model | || | | | - Binomial | || | | | - Poisson | || | | | - Logistic | || | +--------------+--------------+ |+-----------------------------------+-----------------------------------+| Decomposition of variance | - Analysis of variance || | - Analysis of covariance || | - Multivariate AOV |+-----------------------------------+-----------------------------------+| Model exploration | - Stepwise regression || | - Model selection || | - Mallows's C_(p) || | - AIC || | - BIC || | - Model specification || | - Regression validation |+-----------------------------------+-----------------------------------+| Background | - Mean and predicted response || | - Gauss–Markov theorem || | - Errors and residuals || | - Goodness of fit || | - Studentized residual || | - Minimum mean-square error || | - Frisch–Waugh–Lovell theorem |+-----------------------------------+-----------------------------------+| Design of experiments | - Response surface methodology || | - Optimal design || | - Bayesian design |+-----------------------------------+-----------------------------------+| Numerical approximation | - Numerical analysis || | - Approximation theory || | - Numerical integration || | - Gaussian quadrature || | - Orthogonal polynomials || | - Chebyshev polynomials || | - Chebyshev nodes |+-----------------------------------+-----------------------------------+| Applications | - Curve fitting || | - Calibration curve || | - Numerical smoothing and || | differentiation || | - System identification || | - Moving least squares |+-----------------------------------+-----------------------------------+| - Regression analysis category | || - Statistics category | || - [icon] Mathematics portal | || - Statistics outline | || - Statistics topics | |+-----------------------------------+-----------------------------------++-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || Statistics | |+-----------------------------------+-----------------------------------+| - Outline | || - Index | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Descriptive | | | || | statistics | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Data | | | || | collection | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Statistical | | | || | inference | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | - | | | || | Correlation | | | || | - | | | || | Regression | | | || | analysis | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | C | | | || | ategorical / | | | || | Mu | | | || | ltivariate / | | | || | T | | | || | ime-series / | | | || | Survival | | | || | analysis | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Applications | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| - []Category | || - [icon] Mathematics portal | || - []Commons | || - [] WikiProject | |+-----------------------------------+-----------------------------------+Statistical model for a binary dependent variable"Logit model" redirects here. Not to be confused with Logit function.[]Example graph of a logistic regression curve fitted to data. The curveshows the probability of passing an exam (binary dependent variable)versus hours studying (scalar independent variable). See § Example forworked details.In statistics, the logistic model (or logit model) is a statisticalmodel that models the probability of an event taking place by having thelog-odds for the event be a linear combination of one or moreindependent variables. In regression analysis, logistic regression^([1])(or logit regression) is estimating the parameters of a logistic model(the coefficients in the linear combination). Formally, in binarylogistic regression there is a single binary dependent variable, codedby an indicator variable, where the two values are labeled "0" and "1",while the independent variables can each be a binary variable (twoclasses, coded by an indicator variable) or a continuous variable (anyreal value). The corresponding probability of the value labeled "1" canvary between 0 (certainly the value "0") and 1 (certainly the value"1"), hence the labeling;^([2]) the function that converts log-odds toprobability is the logistic function, hence the name. The unit ofmeasurement for the log-odds scale is called a logit, from logisticunit, hence the alternative names. See § Background and § Definition forformal mathematics, and § Example for a worked example.Binary variables are widely used in statistics to model the probabilityof a certain class or event taking place, such as the probability of ateam winning, of a patient being healthy, etc. (see § Applications), andthe logistic model has been the most commonly used model for binaryregression since about 1970.^([3]) Binary variables can be generalizedto categorical variables when there are more than two possible values(e.g. whether an image is of a cat, dog, lion, etc.), and the binarylogistic regression generalized to multinomial logistic regression. Ifthe multiple categories are ordered, one can use the ordinal logisticregression (for example the proportional odds ordinal logisticmodel^([4])). See § Extensions for further extensions. The logisticregression model itself simply models probability of output in terms ofinput and does not perform statistical classification (it is not aclassifier), though it can be used to make a classifier, for instance bychoosing a cutoff value and classifying inputs with probability greaterthan the cutoff as one class, below the cutoff as the other; this is acommon way to make a binary classifier.Analogous linear models for binary variables with a different sigmoidfunction instead of the logistic function (to convert the linearcombination to a probability) can also be used, most notably the probitmodel; see § Alternatives. The defining characteristic of the logisticmodel is that increasing one of the independent variablesmultiplicatively scales the odds of the given outcome at a constantrate, with each independent variable having its own parameter; for abinary dependent variable this generalizes the odds ratio. Moreabstractly, the logistic function is the natural parameter for theBernoulli distribution, and in this sense is the "simplest" way toconvert a real number to a probability. In particular, it maximizesentropy (minimizes added information), and in this sense makes thefewest assumptions of the data being modeled; see § Maximum entropy.The parameters of a logistic regression are most commonly estimated bymaximum-likelihood estimation (MLE). This does not have a closed-formexpression, unlike linear least squares; see § Model fitting. Logisticregression by MLE plays a similarly basic role for binary or categoricalresponses as linear regression by ordinary least squares (OLS) plays forscalar responses: it is a simple, well-analyzed baseline model; see§ Comparison with linear regression for discussion. The logisticregression as a general statistical model was originally developed andpopularized primarily by Joseph Berkson,^([5]) beginning in Berkson(1944) harvtxt error: no target: CITEREFBerkson1944 (help), where hecoined "logit"; see § History.+-----------------------------------------------------------------------+| Part of a series on |+-----------------------------------------------------------------------+| Regression analysis |+-----------------------------------------------------------------------+| Models |+-----------------------------------------------------------------------+| - Linear regression || - Simple regression || - Polynomial regression || - General linear model |+-----------------------------------------------------------------------+| - Generalized linear model || - Vector generalized linear model || - Discrete choice || - Binomial regression || - Binary regression || - Logistic regression || - Multinomial logistic regression || - Mixed logit || - Probit || - Multinomial probit || - Ordered logit || - Ordered probit || - Poisson |+-----------------------------------------------------------------------+| - Multilevel model || - Fixed effects || - Random effects || - Linear mixed-effects model || - Nonlinear mixed-effects model |+-----------------------------------------------------------------------+| - Nonlinear regression || - Nonparametric || - Semiparametric || - Robust || - Quantile || - Isotonic || - Principal components || - Least angle || - Local || - Segmented |+-----------------------------------------------------------------------+| - Errors-in-variables |+-----------------------------------------------------------------------+| Estimation |+-----------------------------------------------------------------------+| - Least squares || - Linear || - Non-linear |+-----------------------------------------------------------------------+| - Ordinary || - Weighted || - Generalized || - Generalized estimating equation |+-----------------------------------------------------------------------+| - Partial || - Total || - Non-negative || - Ridge regression || - Regularized |+-----------------------------------------------------------------------+| - Least absolute deviations || - Iteratively reweighted || - Bayesian || - Bayesian multivariate || - Least-squares spectral analysis |+-----------------------------------------------------------------------+| Background |+-----------------------------------------------------------------------+| - Regression validation || - Mean and predicted response || - Errors and residuals || - Goodness of fit || - Studentized residual || - Gauss–Markov theorem |+-----------------------------------------------------------------------+| - [icon] Mathematics portal |+-----------------------------------------------------------------------+| - v || - t || - e |+-----------------------------------------------------------------------+Applications[edit]Logistic regression is used in various fields, including machinelearning, most medical fields, and social sciences. For example, theTrauma and Injury Severity Score (TRISS), which is widely used topredict mortality in injured patients, was originally developed by Boydet al. using logistic regression.^([6]) Many other medical scales usedto assess severity of a patient have been developed using logisticregression.^([7][8][9][10]) Logistic regression may be used to predictthe risk of developing a given disease (e.g. diabetes; coronary heartdisease), based on observed characteristics of the patient (age, sex,body mass index, results of various blood tests, etc.).^([11][12])Another example might be to predict whether a Nepalese voter will voteNepali Congress or Communist Party of Nepal or Any Other Party, based onage, income, sex, race, state of residence, votes in previous elections,etc.^([13]) The technique can also be used in engineering, especiallyfor predicting the probability of failure of a given process, system orproduct.^([14][15]) It is also used in marketing applications such asprediction of a customer's propensity to purchase a product or halt asubscription, etc.^([16]) In economics, it can be used to predict thelikelihood of a person ending up in the labor force, and a businessapplication would be to predict the likelihood of a homeowner defaultingon a mortgage. Conditional random fields, an extension of logisticregression to sequential data, are used in natural language processing.Example[edit]Problem[edit]As a simple example, we can use a logistic regression with oneexplanatory variable and two categories to answer the followingquestion: A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?The reason for using logistic regression for this problem is that thevalues of the dependent variable, pass and fail, while represented by"1" and "0", are not cardinal numbers. If the problem was changed sothat pass/fail was replaced with the grade 0–100 (cardinal numbers),then simple regression analysis could be used.The table shows the number of hours each student spent studying, andwhether they passed (1) or failed (0). --------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ Hours (x_(k)) 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50 Pass (y_(k)) 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 --------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------We wish to fit a logistic function to the data consisting of the hoursstudied (x_(k)) and the outcome of the test (y_(k) =1 for pass, 0 forfail). The data points are indexed by the subscript k which runs fromk = 1[k=1] to k = K = 20[{\displaystyle k=K=20}]. The x variable iscalled the "explanatory variable", and the y variable is called the"categorical variable" consisting of two categories: "pass" or "fail"corresponding to the categorical values 1 and 0 respectively.Model[edit][]Graph of a logistic regression curve fitted to the (x_(m),y_(m)) data.The curve shows the probability of passing an exam versus hoursstudying.The logistic function is of the form:$p(x) = \frac{1}{1 + e^{- (x - \mu)/s}}$[{\displaystyle p(x)={\frac{1}{1+e^{-(x-\mu )/s}}}}]where μ is a location parameter (the midpoint of the curve, wherep(μ) = 1/2[{\displaystyle p(\mu )=1/2}]) and s is a scale parameter.This expression may be rewritten as:$p(x) = \frac{1}{1 + e^{- (\beta_{0} + \beta_{1}x)}}$[{\displaystylep(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x)}}}}]where β₀ = − μ/s[{\displaystyle \beta _{0}=-\mu /s}] and is known asthe intercept (it is the vertical intercept or y-intercept of the liney = β₀ + β₁x[{\displaystyle y=\beta _{0}+\beta _{1}x}]), andβ₁ = 1/s[{\displaystyle \beta _{1}=1/s}] (inverse scale parameter orrate parameter): these are the y-intercept and slope of the log-odds asa function of x. Conversely, μ = − β₀/β₁[{\displaystyle \mu =-\beta_{0}/\beta _{1}}] and s = 1/β₁[{\displaystyle s=1/\beta _{1}}].Fit[edit]The usual measure of goodness of fit for a logistic regression useslogistic loss (or log loss), the negative log-likelihood. For a givenx_(k) and y_(k), write p_(k) = p(x_(k))[{\displaystyle p_{k}=p(x_{k})}].The p_(k)[p_{k}] are the probabilities that the correspondingy_(k)[y_{k}] will be unity and 1 − p_(k)[{\displaystyle 1-p_{k}}] arethe probabilities that they will be zero (see Bernoulli distribution).We wish to find the values of β₀[\beta _{0}] and β₁[\beta _{1}] whichgive the "best fit" to the data. In the case of linear regression, thesum of the squared deviations of the fit from the data points (y_(k)),the squared error loss, is taken as a measure of the goodness of fit,and the best fit is obtained when that function is minimized.The log loss for the k-th point is:$\left\{ \begin{matrix}{- \ln p_{k}} & {\text{~if~}y_{k} = 1,} \\{- \ln(1 - p_{k})} & {\text{~if~}y_{k} = 0.} \\\end{matrix} \right.$[{\displaystyle {\begin{cases}-\ln p_{k}&{\text{ if}}y_{k}=1,\\-\ln(1-p_{k})&{\text{ if }}y_{k}=0.\end{cases}}}]The log loss can be interpreted as the "surprisal" of the actual outcomey_(k)[y_{k}] relative to the prediction p_(k)[p_{k}], and is a measureof information content. Note that log loss is always greater than orequal to 0, equals 0 only in case of a perfect prediction (i.e., whenp_(k) = 1[{\displaystyle p_{k}=1}] and y_(k) = 1[{\displaystyley_{k}=1}], or p_(k) = 0[p_{k}=0] and y_(k) = 0[{\displaystyley_{k}=0}]), and approaches infinity as the prediction gets worse (i.e.,when y_(k) = 1[{\displaystyle y_{k}=1}] and p_(k) → 0[{\displaystylep_{k}\to 0}] or y_(k) = 0[{\displaystyle y_{k}=0}] andp_(k) → 1[{\displaystyle p_{k}\to 1}]), meaning the actual outcome is"more surprising". Since the value of the logistic function is alwaysstrictly between zero and one, the log loss is always greater than zeroand less than infinity. Note that unlike in a linear regression, wherethe model can have zero loss at a point by passing through a data point(and zero loss overall if all points are on a line), in a logisticregression it is not possible to have zero loss at any points, sincey_(k)[y_{k}] is either 0 or 1, but 0 < p_(k) < 1[{\displaystyle0<p_{k}<1}].These can be combined into a single expression: − y_(k)ln p_(k) − (1−y_(k))ln (1−p_(k)).[{\displaystyle -y_{k}\lnp_{k}-(1-y_{k})\ln(1-p_{k}).}]This expression is more formally known as the cross entropy of thepredicted distribution (p_(k),(1−p_(k)))[{\displaystyle {\big(}p_{k},(1-p_{k}){\big )}}] from the actual distribution(y_(k),(1−y_(k)))[{\displaystyle {\big (}y_{k},(1-y_{k}){\big )}}], asprobability distributions on the two-element space of (pass, fail).The sum of these, the total loss, is the overall negative log-likelihood − ℓ[-\ell ], and the best fit is obtained for those choices of β₀[\beta_{0}] and β₁[\beta _{1}] for which − ℓ[-\ell ] is minimized.Alternatively, instead of minimizing the loss, one can maximize itsinverse, the (positive) log-likelihood:$\ell = \sum\limits_{k:y_{k} = 1}\ln(p_{k}) + \sum\limits_{k:y_{k} = 0}\ln(1 - p_{k}) = \sum\limits_{k = 1}^{K}\left( {\, y_{k}\ln(p_{k}) + (1 - y_{k})\ln(1 - p_{k})} \right)$[{\displaystyle\ell =\sum _{k:y_{k}=1}\ln(p_{k})+\sum _{k:y_{k}=0}\ln(1-p_{k})=\sum_{k=1}^{K}\left(\,y_{k}\ln(p_{k})+(1-y_{k})\ln(1-p_{k})\right)}]or equivalently maximize the likelihood function itself, which is theprobability that the given data set is produced by a particular logisticfunction:$L = \prod\limits_{k:y_{k} = 1}p_{k}\,\prod\limits_{k:y_{k} = 0}(1 - p_{k})$[{\displaystyleL=\prod _{k:y_{k}=1}p_{k}\,\prod _{k:y_{k}=0}(1-p_{k})}]This method is known as maximum likelihood estimation.Parameter estimation[edit]Since ℓ is nonlinear in β₀[\beta _{0}] and β₁[\beta _{1}], determiningtheir optimum values will require numerical methods. Note that onemethod of maximizing ℓ is to require the derivatives of ℓ with respectto β₀[\beta _{0}] and β₁[\beta _{1}] to be zero:$0 = \frac{\partial\ell}{\partial\beta_{0}} = \sum\limits_{k = 1}^{K}(y_{k} - p_{k})$[{\displaystyle0={\frac {\partial \ell }{\partial \beta _{0}}}=\sum_{k=1}^{K}(y_{k}-p_{k})}]$0 = \frac{\partial\ell}{\partial\beta_{1}} = \sum\limits_{k = 1}^{K}(y_{k} - p_{k})x_{k}$[{\displaystyle0={\frac {\partial \ell }{\partial \beta _{1}}}=\sum_{k=1}^{K}(y_{k}-p_{k})x_{k}}]and the maximization procedure can be accomplished by solving the abovetwo equations for β₀[\beta _{0}] and β₁[\beta _{1}], which, again, willgenerally require the use of numerical methods.The values of β₀[\beta _{0}] and β₁[\beta _{1}] which maximize ℓ and Lusing the above data are found to be:β₀ ≈ − 4.1[{\displaystyle \beta _{0}\approx -4.1}]β₁ ≈ 1.5[{\displaystyle \beta _{1}\approx 1.5}]which yields a value for μ and s of:μ = − β₀/β₁ ≈ 2.7[{\displaystyle \mu =-\beta _{0}/\beta _{1}\approx2.7}]s = 1/β₁ ≈ 0.67[{\displaystyle s=1/\beta _{1}\approx 0.67}]Predictions[edit]The β₀[\beta _{0}] and β₁[\beta _{1}] coefficients may be entered intothe logistic regression equation to estimate the probability of passingthe exam.For example, for a student who studies 2 hours, entering the valuex = 2[x = 2] into the equation gives the estimated probability ofpassing the exam of 0.25:t = β₀ + 2β₁ ≈ − 4.1 + 2 ⋅ 1.5 = − 1.1[{\displaystyle t=\beta_{0}+2\beta _{1}\approx -4.1+2\cdot 1.5=-1.1}]$p = \frac{1}{1 + e^{- t}} \approx 0.25 = \text{Probability\ of\ passing\ exam}$[{\displaystylep={\frac {1}{1+e^{-t}}}\approx 0.25={\text{Probability of passingexam}}}]Similarly, for a student who studies 4 hours, the estimated probabilityof passing the exam is 0.87:t = β₀ + 4β₁ ≈ − 4.1 + 4 ⋅ 1.5 = 1.9[{\displaystyle t=\beta _{0}+4\beta_{1}\approx -4.1+4\cdot 1.5=1.9}]$p = \frac{1}{1 + e^{- t}} \approx 0.87 = \text{Probability\ of\ passing\ exam}$[{\displaystylep={\frac {1}{1+e^{-t}}}\approx 0.87={\text{Probability of passingexam}}}]This table shows the estimated probability of passing the exam forseveral values of hours studying. ------------------------ ----------------- ----------------- ----------------------- Hours Passing exam of study (x) Log-odds (t) Odds (e^(t)) Probability (p) 1 −2.57 0.076 ≈ 1:13.1 0.07 2 −1.07 0.34 ≈ 1:2.91 0.26 μ ≈ 2.7[{\displaystyle 0 1 $\frac{1}{2}$[{\tfrac \mu \approx 2.7}] {1}{2}}] = 0.50 3 0.44 1.55 0.61 4 1.94 6.96 0.87 5 3.45 31.4 0.97 ------------------------ ----------------- ----------------- -----------------------Model evaluation[edit]The logistic regression analysis gives the following output. ---------------- ------------- ------------ --------- ---------------- Coefficient Std. Error z-value p-value (Wald) Intercept (β₀) −4.1 1.8 −2.3 0.021 Hours (β₁) 1.5 0.6 2.4 0.017 ---------------- ------------- ------------ --------- ----------------By the Wald test, the output indicates that hours studying issignificantly associated with the probability of passing the exam(p = 0.017[{\displaystyle p=0.017}]). Rather than the Wald method, therecommended method^([citation needed]) to calculate the p-value forlogistic regression is the likelihood-ratio test (LRT), which for thesedata give p ≈ 0.00064[{\displaystyle p\approx 0.00064}] (see § Devianceand likelihood ratio tests below).Generalizations[edit]This simple model is an example of binary logistic regression, and hasone explanatory variable and a binary categorical variable which canassume one of two categorical values. Multinomial logistic regression isthe generalization of binary logistic regression to include any numberof explanatory variables and any number of categories.Background[edit][]Figure 1. The standard logistic function σ(t)[\sigma (t)]; note thatσ(t) ∈ (0,1)[\sigma (t)\in (0,1)] for all t[t].Definition of the logistic function[edit]An explanation of logistic regression can begin with an explanation ofthe standard logistic function. The logistic function is a sigmoidfunction, which takes any real input t[t], and outputs a value betweenzero and one.^([2]) For the logit, this is interpreted as taking inputlog-odds and having output probability. The standard logistic functionσ : ℝ → (0,1)[{\displaystyle \sigma :\mathbb {R} \rightarrow (0,1)}] isdefined as follows:$\sigma(t) = \frac{e^{t}}{e^{t} + 1} = \frac{1}{1 + e^{- t}}$[\sigma(t)={\frac {e^{t}}{e^{t}+1}}={\frac {1}{1+e^{-t}}}]A graph of the logistic function on the t-interval (−6,6) is shown inFigure 1.Let us assume that t[t] is a linear function of a single explanatoryvariable x[x] (the case where t[t] is a linear combination of multipleexplanatory variables is treated similarly). We can then express t[t] asfollows:t = β₀ + β₁x[t=\beta _{0}+\beta _{1}x]And the general logistic function p : ℝ → (0,1)[{\displaystyle p:\mathbb{R} \rightarrow (0,1)}] can now be written as:$p(x) = \sigma(t) = \frac{1}{1 + e^{- (\beta_{0} + \beta_{1}x)}}$[{\displaystylep(x)=\sigma (t)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x)}}}}]In the logistic model, p(x)[p(x)] is interpreted as the probability ofthe dependent variable Y[Y] equaling a success/case rather than afailure/non-case. It's clear that the response variables Y_(i)[Y_{i}]are not identically distributed: P(Y_(i)=1∣X)[P(Y_{i}=1\mid X)] differsfrom one data point X_(i)[X_{i}] to another, though they are independentgiven design matrix X[X] and shared parameters β[\beta ].^([11])Definition of the inverse of the logistic function[edit]We can now define the logit (log odds) function as the inverseg = σ⁻¹[{\displaystyle g=\sigma ^{-1}}] of the standard logisticfunction. It is easy to see that it satisfies:$g(p(x)) = \sigma^{- 1}(p(x)) = {logit}p(x) = \ln\left( \frac{p(x)}{1 - p(x)} \right) = \beta_{0} + \beta_{1}x,$[{\displaystyleg(p(x))=\sigma ^{-1}(p(x))=\operatorname {logit} p(x)=\ln \left({\frac{p(x)}{1-p(x)}}\right)=\beta _{0}+\beta _{1}x,}]and equivalently, after exponentiating both sides we have the odds:$\frac{p(x)}{1 - p(x)} = e^{\beta_{0} + \beta_{1}x}.$[{\displaystyle{\frac {p(x)}{1-p(x)}}=e^{\beta _{0}+\beta _{1}x}.}]Interpretation of these terms[edit]In the above equations, the terms are as follows:- g[g] is the logit function. The equation for g(p(x))[{\displaystyle g(p(x))}] illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression.- ln [\ln ] denotes the natural logarithm.- p(x)[p(x)] is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for p(x)[p(x)] illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability p(x)[p(x)] ranges between 0 and 1.- β₀[\beta _{0}] is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero).- β₁x[\beta _{1}x] is the regression coefficient multiplied by some value of the predictor.- base e[e] denotes the exponential function.Definition of the odds[edit]The odds of the dependent variable equaling a case (given some linearcombination x[x] of the predictors) is equivalent to the exponentialfunction of the linear regression expression. This illustrates how thelogit serves as a link function between the probability and the linearregression expression. Given that the logit ranges between negative andpositive infinity, it provides an adequate criterion upon which toconduct linear regression and the logit is easily converted back intothe odds.^([2])So we define odds of the dependent variable equaling a case (given somelinear combination x[x] of the predictors) as follows:odds = e^(β₀ + β₁x).[{\text{odds}}=e^{\beta _{0}+\beta _{1}x}.]The odds ratio[edit]For a continuous independent variable the odds ratio can be defined as:[]The image represents an outline of what an odds ratio looks like inwriting, through a template in addition to the test score example in the"Example" section of the contents. In simple terms, if we hypotheticallyget an odds ratio of 2 to 1, we can say... "For every one-unit increasein hours studied, the odds of passing (group 1) or failing (group 0) are(expectedly) 2 to 1 (Denis, 2019).${OR} = \frac{{odds}(x + 1)}{{odds}(x)} = \frac{\left( \frac{p(x + 1)}{1 - p(x + 1)} \right)}{\left( \frac{p(x)}{1 - p(x)} \right)} = \frac{e^{\beta_{0} + \beta_{1}(x + 1)}}{e^{\beta_{0} + \beta_{1}x}} = e^{\beta_{1}}$[{\displaystyle\mathrm {OR} ={\frac {\operatorname {odds} (x+1)}{\operatorname {odds}(x)}}={\frac {\left({\frac {p(x+1)}{1-p(x+1)}}\right)}{\left({\frac{p(x)}{1-p(x)}}\right)}}={\frac {e^{\beta _{0}+\beta_{1}(x+1)}}{e^{\beta _{0}+\beta _{1}x}}}=e^{\beta _{1}}}]This exponential relationship provides an interpretation for β₁[\beta_{1}]: The odds multiply by e^(β₁)[e^{\beta _{1}}] for every 1-unitincrease in x.^([17])For a binary independent variable the odds ratio is defined as$\frac{ad}{bc}$[{\frac {ad}{bc}}] where a, b, c and d are cells in a 2×2contingency table.^([18])Multiple explanatory variables[edit]If there are multiple explanatory variables, the above expressionβ₀ + β₁x[\beta _{0}+\beta _{1}x] can be revised to$\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{m}x_{m} = \beta_{0} + \sum\limits_{i = 1}^{m}\beta_{i}x_{i}$[{\displaystyle\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots +\beta _{m}x_{m}=\beta_{0}+\sum _{i=1}^{m}\beta _{i}x_{i}}]. Then when this is used in theequation relating the log odds of a success to the values of thepredictors, the linear regression will be a multiple regression with mexplanators; the parameters β_(j)[\beta _{j}] for allj = 0, 1, 2, …, m[{\displaystyle j=0,1,2,\dots ,m}] are all estimated.Again, the more traditional equations are:$\log\frac{p}{1 - p} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{m}x_{m}$[{\displaystyle\log {\frac {p}{1-p}}=\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots+\beta _{m}x_{m}}]and$p = \frac{1}{1 + b^{- (\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{m}x_{m})}}$[{\displaystylep={\frac {1}{1+b^{-(\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots+\beta _{m}x_{m})}}}}]where usually b = e[{\displaystyle b=e}].Definition[edit]The basic setup of logistic regression is as follows. We are given adataset containing N points. Each point i consists of a set of m inputvariables x_(1,i) ... x_(m,i) (also called independent variables,explanatory variables, predictor variables, features, or attributes),and a binary outcome variable Y_(i) (also known as a dependent variable,response variable, output variable, or class), i.e. it can assume onlythe two possible values 0 (often meaning "no" or "failure") or 1 (oftenmeaning "yes" or "success"). The goal of logistic regression is to usethe dataset to create a predictive model of the outcome variable.As in linear regression, the outcome variables Y_(i) are assumed todepend on the explanatory variables x_(1,i) ... x_(m,i).Explanatory variablesThe explanatory variables may be of any type: real-valued, binary,categorical, etc. The main distinction is between continuous variablesand discrete variables.(Discrete variables referring to more than two possible choices aretypically coded using dummy variables (or indicator variables), that is,separate explanatory variables taking the value 0 or 1 are created foreach possible value of the discrete variable, with a 1 meaning "variabledoes have the given value" and a 0 meaning "variable does not have thatvalue".)Outcome variablesFormally, the outcomes Y_(i) are described as beingBernoulli-distributed data, where each outcome is determined by anunobserved probability p_(i) that is specific to the outcome at hand,but related to the explanatory variables. This can be expressed in anyof the following equivalent forms:$\begin{matrix}{Y_{i} \mid x_{1,i},\ldots,x_{m,i}\ } & {\sim {Bernoulli}(p_{i})} \\{\mathbb{E}\lbrack Y_{i} \mid x_{1,i},\ldots,x_{m,i}\rbrack} & {= p_{i}} \\{\Pr(Y_{i} = y \mid x_{1,i},\ldots,x_{m,i})} & {= \left\{ \begin{matrix}p_{i} & {\text{if~}y = 1} \\{1 - p_{i}} & {\text{if~}y = 0} \\\end{matrix} \right.} \\{\Pr(Y_{i} = y \mid x_{1,i},\ldots,x_{m,i})} & {= p_{i}^{y}(1 - p_{i})^{(1 - y)}} \\\end{matrix}$[{\displaystyle {\begin{aligned}Y_{i}\mid x_{1,i},\ldots,x_{m,i}\ &\sim \operatorname {Bernoulli} (p_{i})\\\operatorname{\mathbb {E} } [Y_{i}\mid x_{1,i},\ldots,x_{m,i}]&=p_{i}\\\Pr(Y_{i}=y\mid x_{1,i},\ldots,x_{m,i})&={\begin{cases}p_{i}&{\text{if }}y=1\\1-p_{i}&{\text{if}}y=0\end{cases}}\\\Pr(Y_{i}=y\mid x_{1,i},\ldots,x_{m,i})&=p_{i}^{y}(1-p_{i})^{(1-y)}\end{aligned}}}]The meanings of these four lines are:1. The first line expresses the probability distribution of each Y_(i) : conditioned on the explanatory variables, it follows a Bernoulli distribution with parameters p_(i), the probability of the outcome of 1 for trial i. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success p_(i) is not observed, only the outcome of an individual Bernoulli trial using that probability.2. The second line expresses the fact that the expected value of each Y_(i) is equal to the probability of success p_(i), which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success p_(i), then take the average of all the 1 and 0 outcomes, then the result would be close to p_(i). This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.3. The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.4. The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that Y_(i) can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either p_(i) or 1 − p_(i), as in the previous line.Linear predictor functionThe basic idea of logistic regression is to use the mechanism alreadydeveloped for linear regression by modeling the probability p_(i) usinga linear predictor function, i.e. a linear combination of theexplanatory variables and a set of regression coefficients that arespecific to the model at hand but the same for all trials. The linearpredictor function f(i)[f(i)] for a particular data point i is writtenas:f(i) = β₀ + β₁x_(1, i) + ⋯ + β_(m)x_(m, i),[f(i)=\beta _{0}+\beta_{1}x_{1,i}+\cdots +\beta _{m}x_{m,i},]where β₀, …, β_(m)[\beta _{0},\ldots ,\beta _{m}] are regressioncoefficients indicating the relative effect of a particular explanatoryvariable on the outcome.The model is usually put into a more compact form as follows:- The regression coefficients β₀, β₁, ..., β_(m) are grouped into a single vector β of size m + 1.- For each data point i, an additional explanatory pseudo-variable x_(0,i) is added, with a fixed value of 1, corresponding to the intercept coefficient β₀.- The resulting explanatory variables x_(0,i), x_(1,i), ..., x_(m,i) are then grouped into a single vector X_(i) of size m + 1.This makes it possible to write the linear predictor function asfollows:f(i) = β ⋅ X_(i),[f(i)={\boldsymbol {\beta }}\cdot \mathbf {X} _{i},]using the notation for a dot product between two vectors.Many explanatory variables, two categories[edit]The above example of binary logistic regression on one explanatoryvariable can be generalized to binary logistic regression on any numberof explanatory variables x₁, x₂,... and any number of categorical valuesy = 0, 1, 2, …[{\displaystyle y=0,1,2,\dots }].To begin with, we may consider a logistic model with M explanatoryvariables, x₁, x₂ ... x_(M) and, as in the example above, twocategorical values (y = 0 and 1). For the simple binary logisticregression model, we assumed a linear relationship between the predictorvariable and the log-odds (also called logit) of the event thaty = 1[y=1]. This linear relationship may be extended to the case of Mexplanatory variables:$t = \log_{b}\frac{p}{1 - p} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{M}x_{M}$[{\displaystylet=\log _{b}{\frac {p}{1-p}}=\beta _{0}+\beta _{1}x_{1}+\beta_{2}x_{2}+\cdots +\beta _{M}x_{M}}]where t is the log-odds and β_(i)[\beta _{i}] are parameters of themodel. An additional generalization has been introduced in which thebase of the model (b) is not restricted to the Euler number e. In mostapplications, the base b[b] of the logarithm is usually taken to be e.However, in some cases it can be easier to communicate results byworking in base 2 or base 10.For a more compact notation, we will specify the explanatory variablesand the β coefficients as (M+1)[{\displaystyle (M+1)}]-dimensionalvectors:x = {x₀, x₁, x₂, …, x_(M)}[{\displaystyle {\boldsymbol{x}}=\{x_{0},x_{1},x_{2},\dots ,x_{M}\}}]β = {β₀, β₁, β₂, …, β_(M)}[{\displaystyle {\boldsymbol {\beta }}=\{\beta_{0},\beta _{1},\beta _{2},\dots ,\beta _{M}\}}]with an added explanatory variable x₀ =1. The logit may now be writtenas:$t = \sum\limits_{m = 0}^{M}\beta_{m}x_{m} = \mathbf{\beta} \cdot x$[{\displaystylet=\sum _{m=0}^{M}\beta _{m}x_{m}={\boldsymbol {\beta }}\cdot x}]Solving for the probability p that y = 1[y=1] yields:$p(\mathbf{x}) = \frac{b^{\mathbf{\beta} \cdot \mathbf{x}}}{1 + b^{\mathbf{\beta} \cdot \mathbf{x}}} = \frac{1}{1 + b^{- \mathbf{\beta} \cdot \mathbf{x}}} = S_{b}(t)$[{\displaystylep({\boldsymbol {x}})={\frac {b^{{\boldsymbol {\beta }}\cdot {\boldsymbol{x}}}}{1+b^{{\boldsymbol {\beta }}\cdot {\boldsymbol {x}}}}}={\frac{1}{1+b^{-{\boldsymbol {\beta }}\cdot {\boldsymbol {x}}}}}=S_{b}(t)}],where S_(b)[{\displaystyle S_{b}}] is the sigmoid function with baseb[b]. The above formula shows that once the β_(m)[\beta_m] are fixed, wecan easily compute either the log-odds that y = 1[y=1] for a givenobservation, or the probability that y = 1[y=1] for a given observation.The main use-case of a logistic model is to be given an observationx[{\boldsymbol {x}}], and estimate the probability p(x)[{\displaystylep({\boldsymbol {x}})}] that y = 1[y=1]. The optimum beta coefficientsmay again be found by maximizing the log-likelihood. For K measurements,defining x_(k)[{\displaystyle {\boldsymbol {x}}_{k}}] as the explanatoryvector of the k-th measurement, and y_(k)[y_{k}] as the categoricaloutcome of that measurement, the log likelihood may be written in a formvery similar to the simple M = 1[M=1] case above:$\ell = \sum\limits_{k = 1}^{K}y_{k}\log_{b}(p(\mathbf{x}_{\mathbf{k}})) + \sum\limits_{k = 1}^{K}(1 - y_{k})\log_{b}(1 - p(\mathbf{x}_{\mathbf{k}}))$[{\displaystyle\ell =\sum _{k=1}^{K}y_{k}\log _{b}(p({\boldsymbol {x_{k}}}))+\sum_{k=1}^{K}(1-y_{k})\log _{b}(1-p({\boldsymbol {x_{k}}}))}]As in the simple example above, finding the optimum β parameters willrequire numerical methods. One useful technique is to equate thederivatives of the log likelihood with respect to each of the βparameters to zero yielding a set of equations which will hold at themaximum of the log likelihood:$\frac{\partial\ell}{\partial\beta_{m}} = 0 = \sum\limits_{k = 1}^{K}y_{k}x_{mk} - \sum\limits_{k = 1}^{K}p(\mathbf{x}_{k})x_{mk}$[{\displaystyle{\frac {\partial \ell }{\partial \beta _{m}}}=0=\sum_{k=1}^{K}y_{k}x_{mk}-\sum _{k=1}^{K}p({\boldsymbol {x}}_{k})x_{mk}}]where x_(mk) is the value of the x_(m) explanatory variable from thek-th measurement.Consider an example with M = 2[M=2] explanatory variables, b = 10[b=10],and coefficients β₀ = − 3[{\displaystyle \beta _{0}=-3}],β₁ = 1[{\displaystyle \beta _{1}=1}], and β₂ = 2[{\displaystyle \beta_{2}=2}] which have been determined by the above method. To be concrete,the model is:$t = \log_{10}\frac{p}{1 - p} = - 3 + x_{1} + 2x_{2}$[{\displaystylet=\log _{10}{\frac {p}{1-p}}=-3+x_{1}+2x_{2}}]$p = \frac{b^{\mathbf{\beta} \cdot \mathbf{x}}}{1 + b^{\mathbf{\beta} \cdot x}} = \frac{b^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}}{1 + b^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}} = \frac{1}{1 + b^{- (\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2})}}$[{\displaystylep={\frac {b^{{\boldsymbol {\beta }}\cdot {\boldsymbol{x}}}}{1+b^{{\boldsymbol {\beta }}\cdot x}}}={\frac {b^{\beta _{0}+\beta_{1}x_{1}+\beta _{2}x_{2}}}{1+b^{\beta _{0}+\beta _{1}x_{1}+\beta_{2}x_{2}}}}={\frac {1}{1+b^{-(\beta _{0}+\beta _{1}x_{1}+\beta_{2}x_{2})}}}}],where p is the probability of the event that y = 1[y=1]. This can beinterpreted as follows:- β₀ = − 3[{\displaystyle \beta _{0}=-3}] is the y-intercept. It is the log-odds of the event that y = 1[y=1], when the predictors x₁ = x₂ = 0[{\displaystyle x_{1}=x_{2}=0}]. By exponentiating, we can see that when x₁ = x₂ = 0[{\displaystyle x_{1}=x_{2}=0}] the odds of the event that y = 1[y=1] are 1-to-1000, or 10⁻³[10^{-3}]. Similarly, the probability of the event that y = 1[y=1] when x₁ = x₂ = 0[{\displaystyle x_{1}=x_{2}=0}] can be computed as 1/(1000+1) = 1/1001.[{\displaystyle 1/(1000+1)=1/1001.}]- β₁ = 1[{\displaystyle \beta _{1}=1}] means that increasing x₁[x_{1}] by 1 increases the log-odds by 1[1]. So if x₁[x_{1}] increases by 1, the odds that y = 1[y=1] increase by a factor of 10¹[{\displaystyle 10^{1}}]. Note that the probability of y = 1[y=1] has also increased, but it has not increased by as much as the odds have increased.- β₂ = 2[{\displaystyle \beta _{2}=2}] means that increasing x₂[x_{2}] by 1 increases the log-odds by 2[2]. So if x₂[x_{2}] increases by 1, the odds that y = 1[y=1] increase by a factor of 10².[{\displaystyle 10^{2}.}] Note how the effect of x₂[x_{2}] on the log-odds is twice as great as the effect of x₁[x_{1}], but the effect on the odds is 10 times greater. But the effect on the probability of y = 1[y=1] is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.Multinomial logistic regression: Many explanatory variables and many categories[edit]Main article: Multinomial logistic regressionIn the above cases of two categories (binomial logistic regression), thecategories were indexed by "0" and "1", and we had two probabilitydistributions: The probability that the outcome was in category 1 wasgiven by p(x)[{\displaystyle p({\boldsymbol {x}})}]and the probabilitythat the outcome was in category 0 was given by 1 − p(x)[{\displaystyle1-p({\boldsymbol {x}})}]. The sum of both probabilities is equal tounity, as they must be.In general, if we have M + 1[M+1] explanatory variables (including x₀)and N + 1[N+1] categories, we will need N + 1[N+1] separate probabilitydistributions, one for each category, indexed by n, which describe theprobability that the categorical outcome y for explanatory vector x willbe in category y=n. It will also be required that the sum of theseprobabilities over all categories be equal to unity. Using themathematically convenient base e, these probabilities are:$p_{n}(\mathbf{x}) = \frac{e^{\mathbf{\beta}_{n} \cdot \mathbf{x}}}{1 + \sum\limits_{u = 1}^{N}e^{\mathbf{\beta}_{u} \cdot \mathbf{x}}}$[{\displaystylep_{n}({\boldsymbol {x}})={\frac {e^{{\boldsymbol {\beta }}_{n}\cdot{\boldsymbol {x}}}}{1+\sum _{u=1}^{N}e^{{\boldsymbol {\beta }}_{u}\cdot{\boldsymbol {x}}}}}}] for n = 1, 2, …, N[{\displaystyle n=1,2,\dots,N}]$p_{0}(\mathbf{x}) = 1 - \sum\limits_{n = 1}^{N}p_{n}(\mathbf{x}) = \frac{1}{1 + \sum\limits_{u = 1}^{N}e^{\mathbf{\beta}_{u} \cdot \mathbf{x}}}$[{\displaystylep_{0}({\boldsymbol {x}})=1-\sum _{n=1}^{N}p_{n}({\boldsymbol{x}})={\frac {1}{1+\sum _{u=1}^{N}e^{{\boldsymbol {\beta }}_{u}\cdot{\boldsymbol {x}}}}}}]Each of the probabilities except p₀(x)[{\displaystyle p_{0}({\boldsymbol{x}})}] will have their own set of regression coefficientsβ_(n)[{\displaystyle {\boldsymbol {\beta }}_{n}}]. It can be seen that,as required, the sum of the p_(n)(x)[{\displaystyle p_{n}({\boldsymbol{x}})}] over all categories is unity. Note that the selection ofp₀(x)[{\displaystyle p_{0}({\boldsymbol {x}})}] to be defined in termsof the other probabilities is artificial. Any of the probabilities couldhave been selected to be so defined. This special value of n is termedthe "pivot index", and the log-odds (t_(n)) are expressed in terms ofthe pivot probability and are again expressed as a linear combination ofthe explanatory variables:$t_{n} = \ln\left( \frac{p_{n}(\mathbf{x})}{p_{0}(\mathbf{x})} \right) = \mathbf{\beta}_{n} \cdot \mathbf{x}$[{\displaystylet_{n}=\ln \left({\frac {p_{n}({\boldsymbol {x}})}{p_{0}({\boldsymbol{x}})}}\right)={\boldsymbol {\beta }}_{n}\cdot {\boldsymbol {x}}}]Note also that for the simple case of N = 1[N=1], the two-category caseis recovered, with p(x) = p₁(x)[{\displaystyle p({\boldsymbol{x}})=p_{1}({\boldsymbol {x}})}] and p₀(x) = 1 − p₁(x)[{\displaystylep_{0}({\boldsymbol {x}})=1-p_{1}({\boldsymbol {x}})}].The log-likelihood that a particular set of K measurements or datapoints will be generated by the above probabilities can now becalculated. Indexing each measurement by k, let the k-th set of measuredexplanatory variables be denoted by x_(k)[{\displaystyle {\boldsymbol{x}}_{k}}] and their categorical outcomes be denoted by y_(k)[y_{k}]which can be equal to any integer in [0,N]. The log-likelihood is then:$\ell = \sum\limits_{k = 1}^{K}\sum\limits_{n = 0}^{N}\Delta(n,y_{k})\,\ln(p_{n}(\mathbf{x}_{k}))$[{\displaystyle\ell =\sum _{k=1}^{K}\sum _{n=0}^{N}\Delta(n,y_{k})\,\ln(p_{n}({\boldsymbol {x}}_{k}))}]where Δ(n,y_(k))[{\displaystyle \Delta (n,y_{k})}] is an indicatorfunction which is equal to unity if y_(k) = n and zero otherwise. In thecase of two explanatory variables, this indicator function was definedas y_(k) when n = 1 and 1-y_(k) when n = 0. This was convenient, but notnecessary.^([19]) Again, the optimum beta coefficients may be found bymaximizing the log-likelihood function generally using numericalmethods. A possible method of solution is to set the derivatives of thelog-likelihood with respect to each beta coefficient equal to zero andsolve for the beta coefficients:$\frac{\partial\ell}{\partial\beta_{nm}} = 0 = \sum\limits_{k = 1}^{K}\Delta(n,y_{k})x_{mk} - \sum\limits_{k = 1}^{K}p_{n}(\mathbf{x}_{k})x_{mk}$[{\displaystyle{\frac {\partial \ell }{\partial \beta _{nm}}}=0=\sum _{k=1}^{K}\Delta(n,y_{k})x_{mk}-\sum _{k=1}^{K}p_{n}({\boldsymbol {x}}_{k})x_{mk}}]where β_(nm)[{\displaystyle \beta _{nm}}] is the m-th coefficient of theβ_(n)[{\displaystyle {\boldsymbol {\beta }}_{n}}] vector andx_(mk)[{\displaystyle x_{mk}}] is the m-th explanatory variable of thek-th measurement. Once the beta coefficients have been estimated fromthe data, we will be able to estimate the probability that anysubsequent set of explanatory variables will result in any of thepossible outcome categories.Interpretations[edit]There are various equivalent specifications and interpretations oflogistic regression, which fit into different types of more generalmodels, and allow different generalizations.As a generalized linear model[edit]The particular model used by logistic regression, which distinguishes itfrom standard linear regression and from other types of regressionanalysis used for binary-valued outcomes, is the way the probability ofa particular outcome is linked to the linear predictor function:${logit}(\mathbb{E}\lbrack Y_{i} \mid x_{1,i},\ldots,x_{m,i}\rbrack) = {logit}(p_{i}) = \ln\left( \frac{p_{i}}{1 - p_{i}} \right) = \beta_{0} + \beta_{1}x_{1,i} + \cdots + \beta_{m}x_{m,i}$[{\displaystyle\operatorname {logit} (\operatorname {\mathbb {E} } [Y_{i}\midx_{1,i},\ldots ,x_{m,i}])=\operatorname {logit} (p_{i})=\ln \left({\frac{p_{i}}{1-p_{i}}}\right)=\beta _{0}+\beta _{1}x_{1,i}+\cdots +\beta_{m}x_{m,i}}]Written using the more compact notation described above, this is:${logit}(\mathbb{E}\lbrack Y_{i} \mid \mathbf{X}_{i}\rbrack) = {logit}(p_{i}) = \ln\left( \frac{p_{i}}{1 - p_{i}} \right) = \mathbf{\beta} \cdot \mathbf{X}_{i}$[{\displaystyle\operatorname {logit} (\operatorname {\mathbb {E} } [Y_{i}\mid \mathbf{X} _{i}])=\operatorname {logit} (p_{i})=\ln \left({\frac{p_{i}}{1-p_{i}}}\right)={\boldsymbol {\beta }}\cdot \mathbf {X} _{i}}]This formulation expresses logistic regression as a type of generalizedlinear model, which predicts variables with various types of probabilitydistributions by fitting a linear predictor function of the above formto some sort of arbitrary transformation of the expected value of thevariable.The intuition for transforming using the logit function (the natural logof the odds) was explained above^([clarification needed]). It also hasthe practical effect of converting the probability (which is bounded tobe between 0 and 1) to a variable that ranges over (−∞,+∞)[(-\infty,+\infty )] — thereby matching the potential range of the linearprediction function on the right side of the equation.Note that both the probabilities p_(i) and the regression coefficientsare unobserved, and the means of determining them is not part of themodel itself. They are typically determined by some sort of optimizationprocedure, e.g. maximum likelihood estimation, that finds values thatbest fit the observed data (i.e. that give the most accurate predictionsfor the data already observed), usually subject to regularizationconditions that seek to exclude unlikely values, e.g. extremely largevalues for any of the regression coefficients. The use of aregularization condition is equivalent to doing maximum a posteriori(MAP) estimation, an extension of maximum likelihood. (Regularization ismost commonly done using a squared regularizing function, which isequivalent to placing a zero-mean Gaussian prior distribution on thecoefficients, but other regularizers are also possible.) Whether or notregularization is used, it is usually not possible to find a closed-formsolution; instead, an iterative numerical method must be used, such asiteratively reweighted least squares (IRLS) or, more commonly thesedays, a quasi-Newton method such as the L-BFGS method.^([20])The interpretation of the β_(j) parameter estimates is as the additiveeffect on the log of the odds for a unit change in the j the explanatoryvariable. In the case of a dichotomous explanatory variable, forinstance, gender e^(β)[e^{\beta }] is the estimate of the odds of havingthe outcome for, say, males compared with females.An equivalent formula uses the inverse of the logit function, which isthe logistic function, i.e.:$\mathbb{E}\lbrack Y_{i} \mid \mathbf{X}_{i}\rbrack = p_{i} = {logit}^{- 1}(\mathbf{\beta} \cdot \mathbf{X}_{i}) = \frac{1}{1 + e^{- \mathbf{\beta} \cdot \mathbf{X}_{i}}}$[{\displaystyle\operatorname {\mathbb {E} } [Y_{i}\mid \mathbf {X}_{i}]=p_{i}=\operatorname {logit} ^{-1}({\boldsymbol {\beta }}\cdot\mathbf {X} _{i})={\frac {1}{1+e^{-{\boldsymbol {\beta }}\cdot \mathbf{X} _{i}}}}}]The formula can also be written as a probability distribution(specifically, using a probability mass function):$\Pr(Y_{i} = y \mid \mathbf{X}_{i}) = {p_{i}}^{y}(1 - p_{i})^{1 - y} = \left( \frac{e^{\mathbf{\beta} \cdot \mathbf{X}_{i}}}{1 + e^{\mathbf{\beta} \cdot \mathbf{X}_{i}}} \right)^{y}\left( {1 - \frac{e^{\mathbf{\beta} \cdot \mathbf{X}_{i}}}{1 + e^{\mathbf{\beta} \cdot \mathbf{X}_{i}}}} \right)^{1 - y} = \frac{e^{\mathbf{\beta} \cdot \mathbf{X}_{i} \cdot y}}{1 + e^{\mathbf{\beta} \cdot \mathbf{X}_{i}}}$[{\displaystyle\Pr(Y_{i}=y\mid \mathbf {X}_{i})={p_{i}}^{y}(1-p_{i})^{1-y}=\left({\frac {e^{{\boldsymbol {\beta}}\cdot \mathbf {X} _{i}}}{1+e^{{\boldsymbol {\beta }}\cdot \mathbf {X}_{i}}}}\right)^{y}\left(1-{\frac {e^{{\boldsymbol {\beta }}\cdot \mathbf{X} _{i}}}{1+e^{{\boldsymbol {\beta }}\cdot \mathbf {X}_{i}}}}\right)^{1-y}={\frac {e^{{\boldsymbol {\beta }}\cdot \mathbf {X}_{i}\cdot y}}{1+e^{{\boldsymbol {\beta }}\cdot \mathbf {X} _{i}}}}}]As a latent-variable model[edit]The logistic model has an equivalent formulation as a latent-variablemodel. This formulation is common in the theory of discrete choicemodels and makes it easier to extend to certain more complicated modelswith multiple, correlated choices, as well as to compare logisticregression to the closely related probit model.Imagine that, for each trial i, there is a continuous latent variableY_(i)^(*) (i.e. an unobserved random variable) that is distributed asfollows:Y_(i)^(*) = β ⋅ X_(i) + ε_(i) [{\displaystyle Y_{i}^{\ast }={\boldsymbol{\beta }}\cdot \mathbf {X} _{i}+\varepsilon _{i}\,}]whereε_(i) ∼ Logistic(0,1) [{\displaystyle \varepsilon _{i}\sim \operatorname{Logistic} (0,1)\,}]i.e. the latent variable can be written directly in terms of the linearpredictor function and an additive random error variable that isdistributed according to a standard logistic distribution.Then Y_(i) can be viewed as an indicator for whether this latentvariable is positive:$Y_{i} = \left\{ \begin{matrix}1 & {\text{if~}Y_{i}^{\ast} > 0\ \text{~i.e.~} - \varepsilon_{i} < \mathbf{\beta} \cdot \mathbf{X}_{i},} \\0 & \text{otherwise.} \\\end{matrix} \right.$[{\displaystyle Y_{i}={\begin{cases}1&{\text{if}}Y_{i}^{\ast }>0\ {\text{ i.e. }}-\varepsilon _{i}<{\boldsymbol {\beta}}\cdot \mathbf {X} _{i},\\0&{\text{otherwise.}}\end{cases}}}]The choice of modeling the error variable specifically with a standardlogistic distribution, rather than a general logistic distribution withthe location and scale set to arbitrary values, seems restrictive, butin fact, it is not. It must be kept in mind that we can choose theregression coefficients ourselves, and very often can use them to offsetchanges in the parameters of the error variable's distribution. Forexample, a logistic error-variable distribution with a non-zero locationparameter μ (which sets the mean) is equivalent to a distribution with azero location parameter, where μ has been added to the interceptcoefficient. Both situations produce the same value for Y_(i)^(*)regardless of settings of explanatory variables. Similarly, an arbitraryscale parameter s is equivalent to setting the scale parameter to 1 andthen dividing all regression coefficients by s. In the latter case, theresulting value of Y_(i)^(*) will be smaller by a factor of s than inthe former case, for all sets of explanatory variables — but critically,it will always remain on the same side of 0, and hence lead to the sameY_(i) choice.(Note that this predicts that the irrelevancy of the scale parameter maynot carry over into more complex models where more than two choices areavailable.)It turns out that this formulation is exactly equivalent to thepreceding one, phrased in terms of the generalized linear model andwithout any latent variables. This can be shown as follows, using thefact that the cumulative distribution function (CDF) of the standardlogistic distribution is the logistic function, which is the inverse ofthe logit function, i.e.Pr (ε_(i)<x) = logit⁻¹(x)[{\displaystyle \Pr(\varepsilon_{i}<x)=\operatorname {logit} ^{-1}(x)}]Then:$\begin{matrix}{\Pr(Y_{i} = 1 \mid \mathbf{X}_{i})} & {= \Pr(Y_{i}^{\ast} > 0 \mid \mathbf{X}_{i})} & & \\ & {= \Pr(\mathbf{\beta} \cdot \mathbf{X}_{i} + \varepsilon_{i} > 0)} & & \\ & {= \Pr(\varepsilon_{i} > - \mathbf{\beta} \cdot \mathbf{X}_{i})} & & \\ & {= \Pr(\varepsilon_{i} < \mathbf{\beta} \cdot \mathbf{X}_{i})} & & \text{(because\ the\ logistic\ distribution\ is\ symmetric)} \\ & {= {logit}^{- 1}(\mathbf{\beta} \cdot \mathbf{X}_{i})} & & \\ & {= p_{i}} & & \text{(see\ above)} \\\end{matrix}$[{\displaystyle {\begin{aligned}\Pr(Y_{i}=1\mid \mathbf {X}_{i})&=\Pr(Y_{i}^{\ast }>0\mid \mathbf {X}_{i})\\[5pt]&=\Pr({\boldsymbol {\beta }}\cdot \mathbf {X}_{i}+\varepsilon _{i}>0)\\[5pt]&=\Pr(\varepsilon _{i}>-{\boldsymbol{\beta }}\cdot \mathbf {X} _{i})\\[5pt]&=\Pr(\varepsilon_{i}<{\boldsymbol {\beta }}\cdot \mathbf {X} _{i})&&{\text{(because thelogistic distribution is symmetric)}}\\[5pt]&=\operatorname {logit}^{-1}({\boldsymbol {\beta }}\cdot \mathbf {X}_{i})&\\[5pt]&=p_{i}&&{\text{(see above)}}\end{aligned}}}]This formulation—which is standard in discrete choice models—makes clearthe relationship between logistic regression (the "logit model") and theprobit model, which uses an error variable distributed according to astandard normal distribution instead of a standard logisticdistribution. Both the logistic and normal distributions are symmetricwith a basic unimodal, "bell curve" shape. The only difference is thatthe logistic distribution has somewhat heavier tails, which means thatit is less sensitive to outlying data (and hence somewhat more robust tomodel mis-specifications or erroneous data).Two-way latent-variable model[edit]Yet another formulation uses two separate latent variables:$\begin{matrix}Y_{i}^{0 \ast} & {= \mathbf{\beta}_{0} \cdot \mathbf{X}_{i} + \varepsilon_{0}\,} \\Y_{i}^{1 \ast} & {= \mathbf{\beta}_{1} \cdot \mathbf{X}_{i} + \varepsilon_{1}\,} \\\end{matrix}$[{\begin{aligned}Y_{i}^{0\ast }&={\boldsymbol {\beta}}_{0}\cdot \mathbf {X} _{i}+\varepsilon _{0}\,\\Y_{i}^{1\ast}&={\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i}+\varepsilon_{1}\,\end{aligned}}]where$\begin{matrix}\varepsilon_{0} & {\sim {EV}_{1}(0,1)} \\\varepsilon_{1} & {\sim {EV}_{1}(0,1)} \\\end{matrix}$[{\begin{aligned}\varepsilon _{0}&\sim \operatorname {EV}_{1}(0,1)\\\varepsilon _{1}&\sim \operatorname {EV}_{1}(0,1)\end{aligned}}]where EV₁(0,1) is a standard type-1 extreme value distribution: i.e.Pr (ε₀=x) = Pr (ε₁=x) = e^(−x)e^(−e^(−x))[\Pr(\varepsilon_{0}=x)=\Pr(\varepsilon _{1}=x)=e^{-x}e^{-e^{-x}}]Then$Y_{i} = \left\{ \begin{matrix}1 & {\text{if~}Y_{i}^{1 \ast} > Y_{i}^{0 \ast},} \\0 & \text{otherwise.} \\\end{matrix} \right.$[Y_{i}={\begin{cases}1&{\text{if }}Y_{i}^{1\ast}>Y_{i}^{0\ast },\\0&{\text{otherwise.}}\end{cases}}]This model has a separate latent variable and a separate set ofregression coefficients for each possible outcome of the dependentvariable. The reason for this separation is that it makes it easy toextend logistic regression to multi-outcome categorical variables, as inthe multinomial logit model. In such a model, it is natural to modeleach possible outcome using a different set of regression coefficients.It is also possible to motivate each of the separate latent variables asthe theoretical utility associated with making the associated choice,and thus motivate logistic regression in terms of utility theory. (Interms of utility theory, a rational actor always chooses the choice withthe greatest associated utility.) This is the approach taken byeconomists when formulating discrete choice models, because it bothprovides a theoretically strong foundation and facilitates intuitionsabout the model, which in turn makes it easy to consider various sortsof extensions. (See the example below.)The choice of the type-1 extreme value distribution seems fairlyarbitrary, but it makes the mathematics work out, and it may be possibleto justify its use through rational choice theory.It turns out that this model is equivalent to the previous model,although this seems non-obvious, since there are now two sets ofregression coefficients and error variables, and the error variableshave a different distribution. In fact, this model reduces directly tothe previous one with the following substitutions:β = β₁ − β₀[{\boldsymbol {\beta }}={\boldsymbol {\beta}}_{1}-{\boldsymbol {\beta }}_{0}]ε = ε₁ − ε₀[\varepsilon =\varepsilon _{1}-\varepsilon _{0}]An intuition for this comes from the fact that, since we choose based onthe maximum of two values, only their difference matters, not the exactvalues — and this effectively removes one degree of freedom. Anothercritical fact is that the difference of two type-1extreme-value-distributed variables is a logistic distribution, i.e.ε = ε₁ − ε₀ ∼ Logistic(0,1).[\varepsilon =\varepsilon _{1}-\varepsilon_{0}\sim \operatorname {Logistic} (0,1).] We can demonstrate theequivalent as follows:$\begin{matrix}{\Pr(Y_{i} = 1 \mid \mathbf{X}_{i}) =} & {\Pr\left( {Y_{i}^{1 \ast} > Y_{i}^{0 \ast} \mid \mathbf{X}_{i}} \right)} & & \\ = & {\Pr\left( {Y_{i}^{1 \ast} - Y_{i}^{0 \ast} > 0 \mid \mathbf{X}_{i}} \right)} & & \\ = & {\Pr\left( {\mathbf{\beta}_{1} \cdot \mathbf{X}_{i} + \varepsilon_{1} - \left( {\mathbf{\beta}_{0} \cdot \mathbf{X}_{i} + \varepsilon_{0}} \right) > 0} \right)} & & \\ = & {\Pr\left( {(\mathbf{\beta}_{1} \cdot \mathbf{X}_{i} - \mathbf{\beta}_{0} \cdot \mathbf{X}_{i}) + (\varepsilon_{1} - \varepsilon_{0}) > 0} \right)} & & \\ = & {\Pr((\mathbf{\beta}_{1} - \mathbf{\beta}_{0}) \cdot \mathbf{X}_{i} + (\varepsilon_{1} - \varepsilon_{0}) > 0)} & & \\ = & {\Pr((\mathbf{\beta}_{1} - \mathbf{\beta}_{0}) \cdot \mathbf{X}_{i} + \varepsilon > 0)} & & {\text{(substitute~}\varepsilon\text{~as\ above)}} \\ = & {\Pr(\mathbf{\beta} \cdot \mathbf{X}_{i} + \varepsilon > 0)} & & {\text{(substitute~}\mathbf{\beta}\text{~as\ above)}} \\ = & {\Pr(\varepsilon > - \mathbf{\beta} \cdot \mathbf{X}_{i})} & & \text{(now,\ same\ as\ above\ model)} \\ = & {\Pr(\varepsilon < \mathbf{\beta} \cdot \mathbf{X}_{i})} & & \\ = & {{logit}^{- 1}(\mathbf{\beta} \cdot \mathbf{X}_{i})} & & \\ = & p_{i} & & \\\end{matrix}$[{\displaystyle {\begin{aligned}\Pr(Y_{i}=1\mid \mathbf {X}_{i})={}&\Pr \left(Y_{i}^{1\ast }>Y_{i}^{0\ast }\mid \mathbf {X}_{i}\right)&\\[5pt]={}&\Pr \left(Y_{i}^{1\ast }-Y_{i}^{0\ast }>0\mid\mathbf {X} _{i}\right)&\\[5pt]={}&\Pr \left({\boldsymbol {\beta}}_{1}\cdot \mathbf {X} _{i}+\varepsilon _{1}-\left({\boldsymbol {\beta}}_{0}\cdot \mathbf {X} _{i}+\varepsilon_{0}\right)>0\right)&\\[5pt]={}&\Pr \left(({\boldsymbol {\beta}}_{1}\cdot \mathbf {X} _{i}-{\boldsymbol {\beta }}_{0}\cdot \mathbf {X}_{i})+(\varepsilon _{1}-\varepsilon_{0})>0\right)&\\[5pt]={}&\Pr(({\boldsymbol {\beta }}_{1}-{\boldsymbol{\beta }}_{0})\cdot \mathbf {X} _{i}+(\varepsilon _{1}-\varepsilon_{0})>0)&\\[5pt]={}&\Pr(({\boldsymbol {\beta }}_{1}-{\boldsymbol {\beta}}_{0})\cdot \mathbf {X} _{i}+\varepsilon >0)&&{\text{(substitute}}\varepsilon {\text{ as above)}}\\[5pt]={}&\Pr({\boldsymbol {\beta}}\cdot \mathbf {X} _{i}+\varepsilon >0)&&{\text{(substitute}}{\boldsymbol {\beta }}{\text{ asabove)}}\\[5pt]={}&\Pr(\varepsilon >-{\boldsymbol {\beta }}\cdot \mathbf{X} _{i})&&{\text{(now, same as above model)}}\\[5pt]={}&\Pr(\varepsilon<{\boldsymbol {\beta }}\cdot \mathbf {X} _{i})&\\[5pt]={}&\operatorname{logit} ^{-1}({\boldsymbol {\beta }}\cdot \mathbf {X}_{i})\\[5pt]={}&p_{i}\end{aligned}}}]Example[edit]+-----------------------------------+-----------------------------------+| [] | This example possibly contains || | original research. Relevant || | discussion may be found on || | Talk:Logistic regression. Please || | improve it by verifying the || | claims made and adding inline || | citations. Statements consisting || | only of original research should || | be removed. (May 2022) (Learn how || | and when to remove this template || | message) |+-----------------------------------+-----------------------------------+As an example, consider a province-level election where the choice isbetween a right-of-center party, a left-of-center party, and asecessionist party (e.g. the Parti Québécois, which wants Quebec tosecede from Canada). We would then use three latent variables, one foreach choice. Then, in accordance with utility theory, we can theninterpret the latent variables as expressing the utility that resultsfrom making each of the choices. We can also interpret the regressioncoefficients as indicating the strength that the associated factor (i.e.explanatory variable) has in contributing to the utility — or morecorrectly, the amount by which a unit change in an explanatory variablechanges the utility of a given choice. A voter might expect that theright-of-center party would lower taxes, especially on rich people. Thiswould give low-income people no benefit, i.e. no change in utility(since they usually don't pay taxes); would cause moderate benefit (i.e.somewhat more money, or moderate utility increase) for middle-incomingpeople; would cause significant benefits for high-income people. On theother hand, the left-of-center party might be expected to raise taxesand offset it with increased welfare and other assistance for the lowerand middle classes. This would cause significant positive benefit tolow-income people, perhaps a weak benefit to middle-income people, andsignificant negative benefit to high-income people. Finally, thesecessionist party would take no direct actions on the economy, butsimply secede. A low-income or middle-income voter might expectbasically no clear utility gain or loss from this, but a high-incomevoter might expect negative utility since he/she is likely to owncompanies, which will have a harder time doing business in such anenvironment and probably lose money.These intuitions can be expressed as follows: --------------- -------------- ------------- -------------- Center-right Center-left Secessionist High-income strong + strong − strong − Middle-income moderate + weak + none Low-income none strong + none --------------- -------------- ------------- -------------- : Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variablesThis clearly shows that1. Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.2. Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that polynomial regression on income is effectively done.As a "log-linear" model[edit]Yet another formulation combines the two-way latent variable formulationabove with the original formulation higher up without latent variables,and in the process provides a link to one of the standard formulationsof the multinomial logit.Here, instead of writing the logit of the probabilities p_(i) as alinear predictor, we separate the linear predictor into two, one foreach of the two outcomes:$\begin{matrix}{\ln\Pr(Y_{i} = 0)} & {= \mathbf{\beta}_{0} \cdot \mathbf{X}_{i} - \ln Z} \\{\ln\Pr(Y_{i} = 1)} & {= \mathbf{\beta}_{1} \cdot \mathbf{X}_{i} - \ln Z} \\\end{matrix}$[{\displaystyle {\begin{aligned}\ln\Pr(Y_{i}=0)&={\boldsymbol {\beta }}_{0}\cdot \mathbf {X} _{i}-\lnZ\\\ln \Pr(Y_{i}=1)&={\boldsymbol {\beta }}_{1}\cdot \mathbf {X}_{i}-\ln Z\end{aligned}}}]Two separate sets of regression coefficients have been introduced, justas in the two-way latent variable model, and the two equations appear aform that writes the logarithm of the associated probability as a linearpredictor, with an extra term − ln Z[-\ln Z] at the end. This term, asit turns out, serves as the normalizing factor ensuring that the resultis a distribution. This can be seen by exponentiating both sides:$\begin{matrix}{\Pr(Y_{i} = 0)} & {= \frac{1}{Z}e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}}} \\{\Pr(Y_{i} = 1)} & {= \frac{1}{Z}e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}} \\\end{matrix}$[{\displaystyle {\begin{aligned}\Pr(Y_{i}=0)&={\frac{1}{Z}}e^{{\boldsymbol {\beta }}_{0}\cdot \mathbf {X}_{i}}\\[5pt]\Pr(Y_{i}=1)&={\frac {1}{Z}}e^{{\boldsymbol {\beta}}_{1}\cdot \mathbf {X} _{i}}\end{aligned}}}]In this form it is clear that the purpose of Z is to ensure that theresulting distribution over Y_(i) is in fact a probability distribution,i.e. it sums to 1. This means that Z is simply the sum of allun-normalized probabilities, and by dividing each probability by Z, theprobabilities become "normalized". That is:Z = e^(β₀ ⋅ X_(i)) + e^(β₁ ⋅ X_(i))[Z=e^{{\boldsymbol {\beta }}_{0}\cdot\mathbf {X} _{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i}}]and the resulting equations are$\begin{matrix}{\Pr(Y_{i} = 0)} & {= \frac{e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}}}{e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}} + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}} \\{\Pr(Y_{i} = 1)} & {= \frac{e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}{e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}} + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}.} \\\end{matrix}$[{\displaystyle {\begin{aligned}\Pr(Y_{i}=0)&={\frac{e^{{\boldsymbol {\beta }}_{0}\cdot \mathbf {X} _{i}}}{e^{{\boldsymbol{\beta }}_{0}\cdot \mathbf {X} _{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot\mathbf {X} _{i}}}}\\[5pt]\Pr(Y_{i}=1)&={\frac {e^{{\boldsymbol {\beta}}_{1}\cdot \mathbf {X} _{i}}}{e^{{\boldsymbol {\beta }}_{0}\cdot\mathbf {X} _{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X}_{i}}}}.\end{aligned}}}]Or generally:$\Pr(Y_{i} = c) = \frac{e^{\mathbf{\beta}_{c} \cdot \mathbf{X}_{i}}}{\sum\limits_{h}e^{\mathbf{\beta}_{h} \cdot \mathbf{X}_{i}}}$[\Pr(Y_{i}=c)={\frac{e^{{\boldsymbol {\beta }}_{c}\cdot \mathbf {X} _{i}}}{\sum_{h}e^{{\boldsymbol {\beta }}_{h}\cdot \mathbf {X} _{i}}}}]This shows clearly how to generalize this formulation to more than twooutcomes, as in multinomial logit.Note that this general formulation isexactly the softmax function as inPr (Y_(i)=c) = softmax(c,β₀⋅X_(i),β₁⋅X_(i),…).[\Pr(Y_{i}=c)=\operatorname{softmax} (c,{\boldsymbol {\beta }}_{0}\cdot \mathbf {X}_{i},{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i},\dots ).]In order to prove that this is equivalent to the previous model, notethat the above model is overspecified, in thatPr (Y_(i)=0)[\Pr(Y_{i}=0)] and Pr (Y_(i)=1)[\Pr(Y_{i}=1)] cannot beindependently specified: ratherPr (Y_(i)=0) + Pr (Y_(i)=1) = 1[\Pr(Y_{i}=0)+\Pr(Y_{i}=1)=1] so knowingone automatically determines the other. As a result, the model isnonidentifiable, in that multiple combinations of β₀ and β₁ will producethe same probabilities for all possible explanatory variables. In fact,it can be seen that adding any constant vector to both of them willproduce the same probabilities:$\begin{matrix}{\Pr(Y_{i} = 1)} & {= \frac{e^{(\mathbf{\beta}_{1} + \mathbf{C}) \cdot \mathbf{X}_{i}}}{e^{(\mathbf{\beta}_{0} + \mathbf{C}) \cdot \mathbf{X}_{i}} + e^{(\mathbf{\beta}_{1} + \mathbf{C}) \cdot \mathbf{X}_{i}}}} \\ & {= \frac{e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}e^{\mathbf{C} \cdot \mathbf{X}_{i}}}{e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}}e^{\mathbf{C} \cdot \mathbf{X}_{i}} + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}e^{\mathbf{C} \cdot \mathbf{X}_{i}}}} \\ & {= \frac{e^{\mathbf{C} \cdot \mathbf{X}_{i}}e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}{e^{\mathbf{C} \cdot \mathbf{X}_{i}}(e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}} + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}})}} \\ & {= \frac{e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}{e^{\mathbf{\beta}_{0} \cdot \mathbf{X}_{i}} + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}.} \\\end{matrix}$[{\displaystyle {\begin{aligned}\Pr(Y_{i}=1)&={\frac{e^{({\boldsymbol {\beta }}_{1}+\mathbf {C} )\cdot \mathbf {X}_{i}}}{e^{({\boldsymbol {\beta }}_{0}+\mathbf {C} )\cdot \mathbf {X}_{i}}+e^{({\boldsymbol {\beta }}_{1}+\mathbf {C} )\cdot \mathbf {X}_{i}}}}\\[5pt]&={\frac {e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X}_{i}}e^{\mathbf {C} \cdot \mathbf {X} _{i}}}{e^{{\boldsymbol {\beta}}_{0}\cdot \mathbf {X} _{i}}e^{\mathbf {C} \cdot \mathbf {X}_{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i}}e^{\mathbf {C}\cdot \mathbf {X} _{i}}}}\\[5pt]&={\frac {e^{\mathbf {C} \cdot \mathbf{X} _{i}}e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X}_{i}}}{e^{\mathbf {C} \cdot \mathbf {X} _{i}}(e^{{\boldsymbol {\beta}}_{0}\cdot \mathbf {X} _{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf{X} _{i}})}}\\[5pt]&={\frac {e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf{X} _{i}}}{e^{{\boldsymbol {\beta }}_{0}\cdot \mathbf {X}_{i}}+e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X}_{i}}}}.\end{aligned}}}]As a result, we can simplify matters, and restore identifiability, bypicking an arbitrary value for one of the two vectors. We choose to setβ₀ = 0.[{\boldsymbol {\beta }}_{0}=\mathbf {0} .] Then,e^(β₀ ⋅ X_(i)) = e^(0 ⋅ X_(i)) = 1[e^{{\boldsymbol {\beta }}_{0}\cdot\mathbf {X} _{i}}=e^{\mathbf {0} \cdot \mathbf {X} _{i}}=1]and so$\Pr(Y_{i} = 1) = \frac{e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}}{1 + e^{\mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}} = \frac{1}{1 + e^{- \mathbf{\beta}_{1} \cdot \mathbf{X}_{i}}} = p_{i}$[\Pr(Y_{i}=1)={\frac{e^{{\boldsymbol {\beta }}_{1}\cdot \mathbf {X} _{i}}}{1+e^{{\boldsymbol{\beta }}_{1}\cdot \mathbf {X} _{i}}}}={\frac {1}{1+e^{-{\boldsymbol{\beta }}_{1}\cdot \mathbf {X} _{i}}}}=p_{i}]which shows that this formulation is indeed equivalent to the previousformulation. (As in the two-way latent variable formulation, anysettings where β = β₁ − β₀[{\boldsymbol {\beta }}={\boldsymbol {\beta}}_{1}-{\boldsymbol {\beta }}_{0}] will produce equivalent results.)Note that most treatments of the multinomial logit model start outeither by extending the "log-linear" formulation presented here or thetwo-way latent variable formulation presented above, since both clearlyshow the way that the model could be extended to multi-way outcomes. Ingeneral, the presentation with latent variables is more common ineconometrics and political science, where discrete choice models andutility theory reign, while the "log-linear" formulation here is morecommon in computer science, e.g. machine learning and natural languageprocessing.As a single-layer perceptron[edit]The model has an equivalent formulation$p_{i} = \frac{1}{1 + e^{- (\beta_{0} + \beta_{1}x_{1,i} + \cdots + \beta_{k}x_{k,i})}}.\,$[p_{i}={\frac{1}{1+e^{-(\beta _{0}+\beta _{1}x_{1,i}+\cdots +\beta_{k}x_{k,i})}}}.\,]This functional form is commonly called a single-layer perceptron orsingle-layer artificial neural network. A single-layer neural networkcomputes a continuous output instead of a step function. The derivativeof p_(i) with respect to X = (x₁, ..., x_(k)) is computed from thegeneral form:$y = \frac{1}{1 + e^{- f(X)}}$[y={\frac {1}{1+e^{-f(X)}}}]where f(X) is an analytic function in X. With this choice, thesingle-layer neural network is identical to the logistic regressionmodel. This function has a continuous derivative, which allows it to beused in backpropagation. This function is also preferred because itsderivative is easily calculated:$\frac{dy}{dX} = y(1 - y)\frac{df}{dX}.\,$[{\frac {\mathrm {d}y}{\mathrm {d} X}}=y(1-y){\frac {\mathrm {d} f}{\mathrm {d} X}}.\,]In terms of binomial data[edit]A closely related model assumes that each i is associated not with asingle Bernoulli trial but with n_(i) independent identicallydistributed trials, where the observation Y_(i) is the number ofsuccesses observed (the sum of the individual Bernoulli-distributedrandom variables), and hence follows a binomial distribution:Y_(i) ∼ Bin(n_(i),p_(i)), for i = 1, …, n[{\displaystyle Y_{i}\,\sim\operatorname {Bin} (n_{i},p_{i}),{\text{ for }}i=1,\dots ,n}]An example of this distribution is the fraction of seeds (p_(i)) thatgerminate after n_(i) are planted.In terms of expected values, this model is expressed as follows:$p_{i} = \mathbb{E}\left\lbrack {\left. {\frac{Y_{i}}{n_{i}}\,} \right|\,\mathbf{X}_{i}} \right\rbrack\,,$[{\displaystylep_{i}=\operatorname {\mathbb {E} } \left[\left.{\frac{Y_{i}}{n_{i}}}\,\right|\,\mathbf {X} _{i}\right]\,,}]so that${logit}\left( {\mathbb{E}\left\lbrack {\left. {\frac{Y_{i}}{n_{i}}\,} \right|\,\mathbf{X}_{i}} \right\rbrack} \right) = {logit}(p_{i}) = \ln\left( \frac{p_{i}}{1 - p_{i}} \right) = \mathbf{\beta} \cdot \mathbf{X}_{i}\,,$[{\displaystyle\operatorname {logit} \left(\operatorname {\mathbb {E} }\left[\left.{\frac {Y_{i}}{n_{i}}}\,\right|\,\mathbf {X}_{i}\right]\right)=\operatorname {logit} (p_{i})=\ln \left({\frac{p_{i}}{1-p_{i}}}\right)={\boldsymbol {\beta }}\cdot \mathbf {X}_{i}\,,}]Or equivalently:$\Pr(Y_{i} = y \mid \mathbf{X}_{i}) = {(\frac{n_{i}}{y})}p_{i}^{y}(1 - p_{i})^{n_{i} - y} = {(\frac{n_{i}}{y})}\left( \frac{1}{1 + e^{- \mathbf{\beta} \cdot \mathbf{X}_{i}}} \right)^{y}\left( {1 - \frac{1}{1 + e^{- \mathbf{\beta} \cdot \mathbf{X}_{i}}}} \right)^{n_{i} - y}\,.$[{\displaystyle\Pr(Y_{i}=y\mid \mathbf {X} _{i})={n_{i} \choosey}p_{i}^{y}(1-p_{i})^{n_{i}-y}={n_{i} \choose y}\left({\frac{1}{1+e^{-{\boldsymbol {\beta }}\cdot \mathbf {X}_{i}}}}\right)^{y}\left(1-{\frac {1}{1+e^{-{\boldsymbol {\beta }}\cdot\mathbf {X} _{i}}}}\right)^{n_{i}-y}\,.}]This model can be fit using the same sorts of methods as the above morebasic model.Model fitting[edit]+-----------------------------------+-----------------------------------+| [[icon]] | This section needs expansion. You || | can help by adding to it. || | (October 2016) |+-----------------------------------+-----------------------------------+Maximum likelihood estimation (MLE)[edit]The regression coefficients are usually estimated using maximumlikelihood estimation.^([21][22]) Unlike linear regression with normallydistributed residuals, it is not possible to find a closed-formexpression for the coefficient values that maximize the likelihoodfunction, so that an iterative process must be used instead; for exampleNewton's method. This process begins with a tentative solution, revisesit slightly to see if it can be improved, and repeats this revisionuntil no more improvement is made, at which point the process is said tohave converged.^([21])In some instances, the model may not reach convergence. Non-convergenceof a model indicates that the coefficients are not meaningful becausethe iterative process was unable to find appropriate solutions. Afailure to converge may occur for a number of reasons: having a largeratio of predictors to cases, multicollinearity, sparseness, or completeseparation.- Having a large ratio of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to non-convergence. Regularized logistic regression is specifically intended to be used in this situation.- Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.^([21]) To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic ^([21]) used to assess whether multicollinearity is unacceptably high.- Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.^([21])- Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion – all cases are accurately classified and the likelihood maximized with infinite coefficients. In such instances, one should re-examine the data, as there may be some kind of error.^([2][further explanation needed])- One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions of a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).^([23])Iteratively reweighted least squares (IRLS)[edit]Binary logistic regression (y = 0[y=0] or y = 1[{\displaystyle y=1}])can, for example, be calculated using iteratively reweighted leastsquares (IRLS), which is equivalent to maximizing the log-likelihood ofa Bernoulli distributed process using Newton's method. If the problem iswritten in vector matrix form, with parametersw^(T) = [β₀, β₁, β₂, …][{\displaystyle \mathbf {w} ^{T}=[\beta_{0},\beta _{1},\beta _{2},\ldots ]}], explanatory variablesx(i) = [1, x₁(i), x₂(i), …]^(T)[{\displaystyle \mathbf {x}(i)=[1,x_{1}(i),x_{2}(i),\ldots ]^{T}}] and expected value of theBernoulli distribution$\mu(i) = \frac{1}{1 + e^{- \mathbf{w}^{T}\mathbf{x}(i)}}$[{\displaystyle\mu (i)={\frac {1}{1+e^{-\mathbf {w} ^{T}\mathbf {x} (i)}}}}], theparameters w[\mathbf {w} ] can be found using the following iterativealgorithm:w_(k + 1) = (X^(T)S_(k)X)⁻¹X^(T)(S_(k)Xw_(k)+y−μ_(k))[{\displaystyle\mathbf {w} _{k+1}=\left(\mathbf {X} ^{T}\mathbf {S} _{k}\mathbf {X}\right)^{-1}\mathbf {X} ^{T}\left(\mathbf {S} _{k}\mathbf {X} \mathbf{w} _{k}+\mathbf {y} -\mathbf {\boldsymbol {\mu }} _{k}\right)}]where S = diag(μ(i)(1−μ(i)))[{\displaystyle \mathbf {S} =\operatorname{diag} (\mu (i)(1-\mu (i)))}] is a diagonal weighting matrix,μ = [μ(1), μ(2), …][{\displaystyle {\boldsymbol {\mu }}=[\mu (1),\mu(2),\ldots ]}] the vector of expected values,$\mathbf{X} = \begin{bmatrix}1 & {x_{1}(1)} & {x_{2}(1)} & \ldots \\1 & {x_{1}(2)} & {x_{2}(2)} & \ldots \\ \vdots & \vdots & \vdots & \\\end{bmatrix}$[{\displaystyle \mathbf {X}={\begin{bmatrix}1&x_{1}(1)&x_{2}(1)&\ldots \\1&x_{1}(2)&x_{2}(2)&\ldots\\\vdots &\vdots &\vdots \end{bmatrix}}}]The regressor matrix and y(i) = [y(1), y(2), …]^(T)[{\displaystyle\mathbf {y} (i)=[y(1),y(2),\ldots ]^{T}}] the vector of responsevariables. More details can be found in the literature.^([24])Bayesian[edit][]Comparison of logistic function with a scaled inverse probit function(i.e. the CDF of the normal distribution), comparing σ(x)[\sigma (x)]vs. $\Phi(\sqrt{\frac{\pi}{8}}x)$[{\textstyle \Phi ({\sqrt {\frac {\pi}{8}}}x)}], which makes the slopes the same at the origin. This showsthe heavier tails of the logistic distribution.In a Bayesian statistics context, prior distributions are normallyplaced on the regression coefficients, for example in the form ofGaussian distributions. There is no conjugate prior of the likelihoodfunction in logistic regression. When Bayesian inference was performedanalytically, this made the posterior distribution difficult tocalculate except in very low dimensions. Now, though, automatic softwaresuch as OpenBUGS, JAGS, PyMC3, Stan or Turing.jl allows these posteriorsto be computed using simulation, so lack of conjugacy is not a concern.However, when the sample size or the number of parameters is large, fullBayesian simulation can be slow, and people often use approximatemethods such as variational Bayesian methods and expectationpropagation."Rule of ten"[edit]Main article: One in ten ruleA widely used rule of thumb, the "one in ten rule", states that logisticregression models give stable values for the explanatory variables ifbased on a minimum of about 10 events per explanatory variable (EPV);where event denotes the cases belonging to the less frequent category inthe dependent variable. Thus a study designed to use k[k] explanatoryvariables for an event (e.g. myocardial infarction) expected to occur ina proportion p[p] of participants in the study will require a total of10k/p[{\displaystyle 10k/p}] participants. However, there isconsiderable debate about the reliability of this rule, which is basedon simulation studies and lacks a secure theoreticalunderpinning.^([25]) According to some authors^([26]) the rule is overlyconservative in some circumstances, with the authors stating, "If we(somewhat subjectively) regard confidence interval coverage less than 93percent, type I error greater than 7 percent, or relative bias greaterthan 15 percent as problematic, our results indicate that problems arefairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observedwith 10–16 EPV. The worst instances of each problem were not severe with5–9 EPV and usually comparable to those with 10–16 EPV".^([27])Others have found results that are not consistent with the above, usingdifferent criteria. A useful criterion is whether the fitted model willbe expected to achieve the same predictive discrimination in a newsample as it appeared to achieve in the model development sample. Forthat criterion, 20 events per candidate variable may be required.^([28])Also, one can argue that 96 observations are needed only to estimate themodel's intercept precisely enough that the margin of error in predictedprobabilities is ±0.1 with a 0.95 confidence level.^([29])Error and significance of fit[edit]Deviance and likelihood ratio test ─ a simple case[edit]In any fitting procedure, the addition of another fitting parameter to amodel (e.g. the beta parameters in a logistic regression model) willalmost always improve the ability of the model to predict the measuredoutcomes. This will be true even if the additional term has nopredictive value, since the model will simply be "overfitting" to thenoise in the data. The question arises as to whether the improvementgained by the addition of another fitting parameter is significantenough to recommend the inclusion of the additional term, or whether theimprovement is simply that which may be expected from overfitting.In short, for logistic regression, a statistic known as the deviance isdefined which is a measure of the error between the logistic model fitand the outcome data. In the limit of a large number of data points, thedeviance is chi-squared distributed, which allows a chi-squared test tobe implemented in order to determine the significance of the explanatoryvariables.Linear regression and logistic regression have many similarities. Forexample, in simple linear regression, a set of K data points (x_(k),y_(k)) are fitted to a proposed model function of the formy = b₀ + b₁x[{\displaystyle y=b_{0}+b_{1}x}]. The fit is obtained bychoosing the b parameters which minimize the sum of the squares of theresiduals (the squared error term) for each data point:$\epsilon^{2} = \sum\limits_{k = 1}^{K}(b_{0} + b_{1}x_{k} - y_{k})^{2}.$[{\displaystyle\epsilon ^{2}=\sum _{k=1}^{K}(b_{0}+b_{1}x_{k}-y_{k})^{2}.}]The minimum value which constitutes the fit will be denoted byϵ̂²[{\displaystyle {\hat {\epsilon }}^{2}}]The idea of a null model may be introduced, in which it is assumed thatthe x variable is of no use in predicting the y_(k) outcomes: The datapoints are fitted to a null model function of the form y=b₀ with asquared error term:$\epsilon^{2} = \sum\limits_{k = 1}^{K}(b_{0} - y_{k})^{2}.$[{\displaystyle\epsilon ^{2}=\sum _{k=1}^{K}(b_{0}-y_{k})^{2}.}]The fitting process consists of choosing a value of b₀ which minimizesϵ²[\epsilon ^{2}] of the fit to the null model, denoted byϵ_(φ)²[{\displaystyle \epsilon _{\varphi }^{2}}] where the φ[\varphi ]subscript denotes the null model. It is seen that the null model isoptimized by $b_{0} = \overset{¯}{y}$[{\displaystyle b_{0}={\overline{y}}}] where $\overset{¯}{y}$[{\overline {y}}] is the mean of the y_(k)values, and the optimized ϵ_(φ)²[{\displaystyle \epsilon _{\varphi}^{2}}] is:${\hat{\epsilon}}_{\varphi}^{2} = \sum\limits_{k = 1}^{K}(\overset{¯}{y} - y_{k})^{2}$[{\displaystyle{\hat {\epsilon }}_{\varphi }^{2}=\sum _{k=1}^{K}({\overline{y}}-y_{k})^{2}}]which is proportional to the square of the (uncorrected) sample standarddeviation of the y_(k) data points.We can imagine a case where the y_(k) data points are randomly assignedto the various x_(k), and then fitted using the proposed model.Specifically, we can consider the fits of the proposed model to everypermutation of the y_(k) outcomes. It can be shown that the optimizederror of any of these fits will never be less than the optimum error ofthe null model, and that the difference between these minimum error willfollow a chi-squared distribution, with degrees of freedom equal thoseof the proposed model minus those of the null model which, in this case,will be 2-1=1. Using the chi-squared test, we may then estimate how manyof these permuted sets of y_(k) will yield an minimum error less than orequal to the minimum error using the original y_(k), and so we canestimate how significant an improvement is given by the inclusion of thex variable in the proposed model.For logistic regression, the measure of goodness-of-fit is thelikelihood function L, or its logarithm, the log-likelihood ℓ. Thelikelihood function L is analogous to the ϵ²[\epsilon ^{2}] in thelinear regression case, except that the likelihood is maximized ratherthan minimized. Denote the maximized log-likelihood of the proposedmodel by $\hat{\ell}$[{\displaystyle {\hat {\ell }}}].In the case of simple binary logistic regression, the set of K datapoints are fitted in a probabilistic sense to a function of the form:$p(x) = \frac{1}{1 + e^{- t}}$[{\displaystyle p(x)={\frac{1}{1+e^{-t}}}}]where p(x)[p(x)] is the probability that y = 1[y=1]. The log-odds aregiven by:t = β₀ + β₁x[{\displaystyle t=\beta _{0}+\beta _{1}x}]and the log-likelihood is:$\ell = \sum\limits_{k = 1}^{K}\left( {y_{k}\ln(p(x_{k})) + (1 - y_{k})\ln(1 - p(x_{k}))} \right)$[{\displaystyle\ell =\sum_{k=1}^{K}\left(y_{k}\ln(p(x_{k}))+(1-y_{k})\ln(1-p(x_{k}))\right)}]For the null model, the probability that y = 1[y=1] is given by:$p_{\varphi}(x) = \frac{1}{1 + e^{- t_{\varphi}}}$[{\displaystylep_{\varphi }(x)={\frac {1}{1+e^{-t_{\varphi }}}}}]The log-odds for the null model are given by:t_(φ) = β₀[{\displaystyle t_{\varphi }=\beta _{0}}]and the log-likelihood is:$\ell_{\varphi} = \sum\limits_{k = 1}^{K}\left( {y_{k}\ln(p_{\varphi}) + (1 - y_{k})\ln(1 - p_{\varphi})} \right)$[{\displaystyle\ell _{\varphi }=\sum _{k=1}^{K}\left(y_{k}\ln(p_{\varphi})+(1-y_{k})\ln(1-p_{\varphi })\right)}]Since we have $p_{\varphi} = \overset{¯}{y}$[{\displaystyle p_{\varphi}={\overline {y}}}] at the maximum of L, the maximum log-likelihood forthe null model is${\hat{\ell}}_{\varphi} = K(\,\overset{¯}{y}\ln(\overset{¯}{y}) + (1 - \overset{¯}{y})\ln(1 - \overset{¯}{y}))$[{\displaystyle{\hat {\ell }}_{\varphi }=K(\,{\overline {y}}\ln({\overline{y}})+(1-{\overline {y}})\ln(1-{\overline {y}}))}]The optimum β₀[\beta _{0}] is:$\beta_{0} = \ln\left( \frac{\overset{¯}{y}}{1 - \overset{¯}{y}} \right)$[{\displaystyle\beta _{0}=\ln \left({\frac {\overline {y}}{1-{\overline {y}}}}\right)}]where $\overset{¯}{y}$[{\overline {y}}] is again the mean of the y_(k)values. Again, we can conceptually consider the fit of the proposedmodel to every permutation of the y_(k) and it can be shown that themaximum log-likelihood of these permutation fits will never be smallerthan that of the null model:$\hat{\ell} \geq {\hat{\ell}}_{\varphi}$[{\displaystyle {\hat {\ell}}\geq {\hat {\ell }}_{\varphi }}]Also, as an analog to the error of the linear regression case, we maydefine the deviance of a logistic regression fit as:$D = \ln\left( \frac{{\hat{L}}^{2}}{{\hat{L}}_{\varphi}^{2}} \right) = 2(\hat{\ell} - {\hat{\ell}}_{\varphi})$[{\displaystyleD=\ln \left({\frac {{\hat {L}}^{2}}{{\hat {L}}_{\varphi}^{2}}}\right)=2({\hat {\ell }}-{\hat {\ell }}_{\varphi })}]which will always be positive or zero. The reason for this choice isthat not only is the deviance a good measure of the goodness of fit, itis also approximately chi-squared distributed, with the approximationimproving as the number of data points (K) increases, becoming exactlychi-square distributed in the limit of an infinite number of datapoints. As in the case of linear regression, we may use this fact toestimate the probability that a random set of data points will give abetter fit than the fit obtained by the proposed model, and so have anestimate how significantly the model is improved by including the x_(k)data points in the proposed model.For the simple model of student test scores described above, the maximumvalue of the log-likelihood of the null model is${\hat{\ell}}_{\varphi} = - 13.8629...$[{\displaystyle {\hat {\ell}}_{\varphi }=-13.8629...}] The maximum value of the log-likelihood forthe simple model is $\hat{\ell} = - 8.02988...$[{\displaystyle {\hat{\ell }}=-8.02988...}] so that the deviance is$D = 2(\hat{\ell} - {\hat{\ell}}_{\varphi}) = 11.6661...$[{\displaystyleD=2({\hat {\ell }}-{\hat {\ell }}_{\varphi })=11.6661...}]Using the chi-squared test of significance, the integral of thechi-squared distribution with one degree of freedom from 11.6661... toinfinity is equal to 0.00063649...This effectively means that about 6 out of a 10,000 fits to random y_(k)can be expected to have a better fit (smaller deviance) than the giveny_(k) and so we can conclude that the inclusion of the x variable anddata in the proposed model is a very significant improvement over thenull model. In other words, we reject the null hypothesis with1 − D ≈ 99.94%[{\displaystyle 1-D\approx 99.94\%}] confidence.Goodness of fit summary[edit]Goodness of fit in linear regression models is generally measured usingR². Since this has no direct analog in logistic regression, variousmethods^([30]: ch.21 ) including the following can be used instead.Deviance and likelihood ratio tests[edit]In linear regression analysis, one is concerned with partitioningvariance via the sum of squares calculations – variance in the criterionis essentially divided into variance accounted for by the predictors andresidual variance. In logistic regression analysis, deviance is used inlieu of a sum of squares calculations.^([31]) Deviance is analogous tothe sum of squares calculations in linear regression^([2]) and is ameasure of the lack of fit to the data in a logistic regressionmodel.^([31]) When a "saturated" model is available (a model with atheoretically perfect fit), deviance is calculated by comparing a givenmodel with the saturated model.^([2]) This computation gives thelikelihood-ratio test:^([2])$D = - 2\ln\frac{\text{likelihood\ of\ the\ fitted\ model}}{\text{likelihood\ of\ the\ saturated\ model}}.$[D=-2\ln{\frac {\text{likelihood of the fitted model}}{\text{likelihood of thesaturated model}}}.]In the above equation, D represents the deviance and ln represents thenatural logarithm. The log of this likelihood ratio (the ratio of thefitted model to the saturated model) will produce a negative value,hence the need for a negative sign. D can be shown to follow anapproximate chi-squared distribution.^([2]) Smaller values indicatebetter fit as the fitted model deviates less from the saturated model.When assessed upon a chi-square distribution, nonsignificant chi-squarevalues indicate very little unexplained variance and thus, good modelfit. Conversely, a significant chi-square value indicates that asignificant amount of the variance is unexplained.When the saturated model is not available (a common case), deviance iscalculated simply as −2·(log likelihood of the fitted model), and thereference to the saturated model's log likelihood can be removed fromall that follows without harm.Two measures of deviance are particularly important in logisticregression: null deviance and model deviance. The null deviancerepresents the difference between a model with only the intercept (whichmeans "no predictors") and the saturated model. The model deviancerepresents the difference between a model with at least one predictorand the saturated model.^([31]) In this respect, the null model providesa baseline upon which to compare predictor models. Given that devianceis a measure of the difference between a given model and the saturatedmodel, smaller values indicate better fit. Thus, to assess thecontribution of a predictor or set of predictors, one can subtract themodel deviance from the null deviance and assess the difference on aχ_(s − p)²,[\chi _{s-p}^{2},] chi-square distribution with degrees offreedom^([2]) equal to the difference in the number of parametersestimated.Let$\begin{matrix}D_{\text{null}} & {= - 2\ln\frac{\text{likelihood\ of\ null\ model}}{\text{likelihood\ of\ the\ saturated\ model}}} \\D_{\text{fitted}} & {= - 2\ln\frac{\text{likelihood\ of\ fitted\ model}}{\text{likelihood\ of\ the\ saturated\ model}}.} \\\end{matrix}$[{\displaystyle {\begin{aligned}D_{\text{null}}&=-2\ln{\frac {\text{likelihood of null model}}{\text{likelihood of thesaturated model}}}\\[6pt]D_{\text{fitted}}&=-2\ln {\frac{\text{likelihood of fitted model}}{\text{likelihood of the saturatedmodel}}}.\end{aligned}}}]Then the difference of both is:$\begin{matrix}{D_{\text{null}} - D_{\text{fitted}}} & {= - 2\left( {\ln\frac{\text{likelihood\ of\ null\ model}}{\text{likelihood\ of\ the\ saturated\ model}} - \ln\frac{\text{likelihood\ of\ fitted\ model}}{\text{likelihood\ of\ the\ saturated\ model}}} \right)} \\ & {= - 2\ln\frac{\left( \frac{\text{likelihood\ of\ null\ model}}{\text{likelihood\ of\ the\ saturated\ model}} \right)}{\left( \frac{\text{likelihood\ of\ fitted\ model}}{\text{likelihood\ of\ the\ saturated\ model}} \right)}} \\ & {= - 2\ln\frac{\text{likelihood\ of\ the\ null\ model}}{\text{likelihood\ of\ fitted\ model}}.} \\\end{matrix}$[{\displaystyle{\begin{aligned}D_{\text{null}}-D_{\text{fitted}}&=-2\left(\ln {\frac{\text{likelihood of null model}}{\text{likelihood of the saturatedmodel}}}-\ln {\frac {\text{likelihood of fitted model}}{\text{likelihoodof the saturated model}}}\right)\\[6pt]&=-2\ln {\frac {\left({\dfrac{\text{likelihood of null model}}{\text{likelihood of the saturatedmodel}}}\right)}{\left({\dfrac {\text{likelihood of fittedmodel}}{\text{likelihood of the saturatedmodel}}}\right)}}\\[6pt]&=-2\ln {\frac {\text{likelihood of the nullmodel}}{\text{likelihood of fitted model}}}.\end{aligned}}}]If the model deviance is significantly smaller than the null deviancethen one can conclude that the predictor or set of predictorssignificantly improve the model's fit. This is analogous to the F-testused in linear regression analysis to assess the significance ofprediction.^([31])Pseudo-R-squared[edit]Main article: Pseudo-R-squaredIn linear regression the squared multiple correlation, R² is used toassess goodness of fit as it represents the proportion of variance inthe criterion that is explained by the predictors.^([31]) In logisticregression analysis, there is no agreed upon analogous measure, butthere are several competing measures each with limitations.^([31][32])Four of the most commonly used indices and one less commonly used oneare examined on this page:- Likelihood ratio R²_(L)- Cox and Snell R²_(CS)- Nagelkerke R²_(N)- McFadden R²_(McF)- Tjur R²_(T)Hosmer–Lemeshow test[edit]The Hosmer–Lemeshow test uses a test statistic that asymptoticallyfollows a χ²[\chi ^{2}] distribution to assess whether or not theobserved event rates match expected event rates in subgroups of themodel population. This test is considered to be obsolete by somestatisticians because of its dependence on arbitrary binning ofpredicted probabilities and relative low power.^([33])Coefficient significance[edit]After fitting the model, it is likely that researchers will want toexamine the contribution of individual predictors. To do so, they willwant to examine the regression coefficients. In linear regression, theregression coefficients represent the change in the criterion for eachunit change in the predictor.^([31]) In logistic regression, however,the regression coefficients represent the change in the logit for eachunit change in the predictor. Given that the logit is not intuitive,researchers are likely to focus on a predictor's effect on theexponential function of the regression coefficient – the odds ratio (seedefinition). In linear regression, the significance of a regressioncoefficient is assessed by computing a t test. In logistic regression,there are several different tests designed to assess the significance ofan individual predictor, most notably the likelihood ratio test and theWald statistic.Likelihood ratio test[edit]The likelihood-ratio test discussed above to assess model fit is alsothe recommended procedure to assess the contribution of individual"predictors" to a given model.^([2][21][31]) In the case of a singlepredictor model, one simply compares the deviance of the predictor modelwith that of the null model on a chi-square distribution with a singledegree of freedom. If the predictor model has significantly smallerdeviance (c.f. chi-square using the difference in degrees of freedom ofthe two models), then one can conclude that there is a significantassociation between the "predictor" and the outcome. Although somecommon statistical packages (e.g. SPSS) do provide likelihood ratio teststatistics, without this computationally intensive test it would be moredifficult to assess the contribution of individual predictors in themultiple logistic regression case.^([citation needed]) To assess thecontribution of individual predictors one can enter the predictorshierarchically, comparing each new model with the previous to determinethe contribution of each predictor.^([31]) There is some debate amongstatisticians about the appropriateness of so-called "stepwise"procedures.^([weasel words]) The fear is that they may not preservenominal statistical properties and may become misleading.^([34])Wald statistic[edit]Alternatively, when assessing the contribution of individual predictorsin a given model, one may examine the significance of the Waldstatistic. The Wald statistic, analogous to the t-test in linearregression, is used to assess the significance of coefficients. The Waldstatistic is the ratio of the square of the regression coefficient tothe square of the standard error of the coefficient and isasymptotically distributed as a chi-square distribution.^([21])$W_{j} = \frac{\beta_{j}^{2}}{SE_{\beta_{j}}^{2}}$[{\displaystyleW_{j}={\frac {\beta _{j}^{2}}{SE_{\beta _{j}}^{2}}}}]Although several statistical packages (e.g., SPSS, SAS) report the Waldstatistic to assess the contribution of individual predictors, the Waldstatistic has limitations. When the regression coefficient is large, thestandard error of the regression coefficient also tends to be largerincreasing the probability of Type-II error. The Wald statistic alsotends to be biased when data are sparse.^([31])Case-control sampling[edit]Suppose cases are rare. Then we might wish to sample them morefrequently than their prevalence in the population. For example, supposethere is a disease that affects 1 person in 10,000 and to collect ourdata we need to do a complete physical. It may be too expensive to dothousands of physicals of healthy people in order to obtain data foronly a few diseased individuals. Thus, we may evaluate more diseasedindividuals, perhaps all of the rare outcomes. This is alsoretrospective sampling, or equivalently it is called unbalanced data. Asa rule of thumb, sampling controls at a rate of five times the number ofcases will produce sufficient control data.^([35])Logistic regression is unique in that it may be estimated on unbalanceddata, rather than randomly sampled data, and still yield correctcoefficient estimates of the effects of each independent variable on theoutcome. That is to say, if we form a logistic model from such data, ifthe model is correct in the general population, the β_(j)[\beta _{j}]parameters are all correct except for β₀[\beta _{0}]. We can correctβ₀[\beta _{0}] if we know the true prevalence as follows:^([35])${\hat{\beta}}_{0}^{\ast} = {\hat{\beta}}_{0} + \log\frac{\pi}{1 - \pi} - \log\frac{\overset{\sim}{\pi}}{1 - \overset{\sim}{\pi}}$[{\displaystyle{\widehat {\beta }}_{0}^{*}={\widehat {\beta }}_{0}+\log {\frac {\pi}{1-\pi }}-\log {{\tilde {\pi }} \over {1-{\tilde {\pi }}}}}]where π[\pi ] is the true prevalence and $\overset{\sim}{\pi}$[{\tilde{\pi }}] is the prevalence in the sample.Discussion[edit]Like other forms of regression analysis, logistic regression makes useof one or more predictor variables that may be either continuous orcategorical. Unlike ordinary linear regression, however, logisticregression is used for predicting dependent variables that takemembership in one of a limited number of categories (treating thedependent variable in the binomial case as the outcome of a Bernoullitrial) rather than a continuous outcome. Given this difference, theassumptions of linear regression are violated. In particular, theresiduals cannot be normally distributed. In addition, linear regressionmay make nonsensical predictions for a binary dependent variable. Whatis needed is a way to convert a binary variable into a continuous onethat can take on any real value (negative or positive). To do that,binomial logistic regression first calculates the odds of the eventhappening for different levels of each independent variable, and thentakes its logarithm to create a continuous criterion as a transformedversion of the dependent variable. The logarithm of the odds is thelogit of the probability, the logit is defined as follows:$${logit}p = \ln\frac{p}{1 - p}\quad\text{for~}0 < p < 1\,.$$[{\displaystyle \operatorname {logit} p=\ln {\frac {p}{1-p}}\quad{\text{for }}0<p<1\,.}]Although the dependent variable in logistic regression is Bernoulli, thelogit is on an unrestricted scale.^([2]) The logit function is the linkfunction in this kind of generalized linear model, i.e.logitℰ(Y) = β₀ + β₁x[{\displaystyle \operatorname {logit} \operatorname {\mathcal {E}}(Y)=\beta _{0}+\beta _{1}x}]Y is the Bernoulli-distributed response variable and x is the predictorvariable; the β values are the linear parameters.The logit of the probability of success is then fitted to thepredictors. The predicted value of the logit is converted back intopredicted odds, via the inverse of the natural logarithm – theexponential function. Thus, although the observed dependent variable inbinary logistic regression is a 0-or-1 variable, the logistic regressionestimates the odds, as a continuous variable, that the dependentvariable is a 'success'. In some applications, the odds are all that isneeded. In others, a specific yes-or-no prediction is needed for whetherthe dependent variable is or is not a 'success'; this categoricalprediction can be based on the computed odds of success, with predictedodds above some chosen cutoff value being translated into a predictionof success.Maximum entropy[edit]Of all the functional forms used for estimating the probabilities of aparticular categorical outcome which optimize the fit by maximizing thelikelihood function (e.g. probit regression, Poisson regression, etc.),the logistic regression solution is unique in that it is a maximumentropy solution.^([36]) This is a case of a general property: anexponential family of distributions maximizes entropy, given an expectedvalue. In the case of the logistic model, the logistic function is thenatural parameter of the Bernoulli distribution (it is in "canonicalform", and the logistic function is the canonical link function), whileother sigmoid functions are non-canonical link functions; this underliesits mathematical elegance and ease of optimization. See Exponentialfamily § Maximum entropy derivation for details.Proof[edit]In order to show this, we use the method of Lagrange multipliers. TheLagrangian is equal to the entropy plus the sum of the products ofLagrange multipliers times various constraint expressions. The generalmultinomial case will be considered, since the proof is not made thatmuch simpler by considering simpler cases. Equating the derivative ofthe Lagrangian with respect to the various probabilities to zero yieldsa functional form for those probabilities which corresponds to thoseused in logistic regression.^([36])As in the above section on multinomial logistic regression, we willconsider M + 1[M+1] explanatory variables denoted x_(m)[x_{m}] and whichinclude x₀ = 1[x_{0}=1]. There will be a total of K data points, indexedby k = {1, 2, …, K}[{\displaystyle k=\{1,2,\dots ,K\}}], and the datapoints are given by x_(mk)[{\displaystyle x_{mk}}] and y_(k)[y_{k}]. Thex_(mk) will also be represented as an (M+1)[{\displaystyle(M+1)}]-dimensional vectorx_(k) = {x_(0k), x_(1k), …, x_(Mk)}[{\displaystyle {\boldsymbol{x}}_{k}=\{x_{0k},x_{1k},\dots ,x_{Mk}\}}]. There will be N + 1[N+1]possible values of the categorical variable y ranging from 0 to N.Let p_(n)(x) be the probability, given explanatory variable vector x,that the outcome will be y = n[{\displaystyle y=n}]. Definep_(nk) = p_(n)(x_(k))[{\displaystyle p_{nk}=p_{n}({\boldsymbol{x}}_{k})}] which is the probability that for the k-th measurement, thecategorical outcome is n.The Lagrangian will be expressed as a function of the probabilitiesp_(nk) and will minimized by equating the derivatives of the Lagrangianwith respect to these probabilities to zero. An important point is thatthe probabilities are treated equally and the fact that they sum tounity is part of the Lagrangian formulation, rather than being assumedfrom the beginning.The first contribution to the Lagrangian is the entropy:$\mathcal{L}_{ent} = - \sum\limits_{k = 1}^{K}\sum\limits_{n = 0}^{N}p_{nk}\ln(p_{nk})$[{\displaystyle{\mathcal {L}}_{ent}=-\sum _{k=1}^{K}\sum _{n=0}^{N}p_{nk}\ln(p_{nk})}]The log-likelihood is:$\ell = \sum\limits_{k = 1}^{K}\sum\limits_{n = 0}^{N}\Delta(n,y_{k})\ln(p_{nk})$[{\displaystyle\ell =\sum _{k=1}^{K}\sum _{n=0}^{N}\Delta (n,y_{k})\ln(p_{nk})}]Assuming the multinomial logistic function, the derivative of thelog-likelihood with respect the beta coefficients was found to be:$\frac{\partial\ell}{\partial\beta_{nm}} = \sum\limits_{k = 1}^{K}(p_{nk}x_{mk} - \Delta(n,y_{k})x_{mk})$[{\displaystyle{\frac {\partial \ell }{\partial \beta _{nm}}}=\sum_{k=1}^{K}(p_{nk}x_{mk}-\Delta (n,y_{k})x_{mk})}]A very important point here is that this expression is (remarkably) notan explicit function of the beta coefficients. It is only a function ofthe probabilities p_(nk) and the data. Rather than being specific to theassumed multinomial logistic case, it is taken to be a general statementof the condition at which the log-likelihood is maximized and makes noreference to the functional form of p_(nk). There are then (M+1)(N+1)fitting constraints and the fitting constraint term in the Lagrangian isthen:$\mathcal{L}_{fit} = \sum\limits_{n = 0}^{N}\sum\limits_{m = 0}^{M}\lambda_{nm}\sum\limits_{k = 1}^{K}(p_{nk}x_{mk} - \Delta(n,y_{k})x_{mk})$[{\displaystyle{\mathcal {L}}_{fit}=\sum _{n=0}^{N}\sum _{m=0}^{M}\lambda _{nm}\sum_{k=1}^{K}(p_{nk}x_{mk}-\Delta (n,y_{k})x_{mk})}]where the λ_(nm) are the appropriate Lagrange multipliers. There are Knormalization constraints which may be written:$\sum\limits_{n = 0}^{N}p_{nk} = 1$[{\displaystyle \sum_{n=0}^{N}p_{nk}=1}]so that the normalization term in the Lagrangian is:$\mathcal{L}_{norm} = \sum\limits_{k = 1}^{K}\alpha_{k}\left( {1 - \sum\limits_{n = 1}^{N}p_{nk}} \right)$[{\displaystyle{\mathcal {L}}_{norm}=\sum _{k=1}^{K}\alpha _{k}\left(1-\sum_{n=1}^{N}p_{nk}\right)}]where the α_(k) are the appropriate Lagrange multipliers. The Lagrangianis then the sum of the above three terms:ℒ = ℒ_(ent) + ℒ_(fit) + ℒ_(norm)[{\displaystyle {\mathcal {L}}={\mathcal{L}}_{ent}+{\mathcal {L}}_{fit}+{\mathcal {L}}_{norm}}]Setting the derivative of the Lagrangian with respect to one of theprobabilities to zero yields:$\frac{\partial\mathcal{L}}{\partial p_{n^{\prime}k^{\prime}}} = 0 = - \ln(p_{n^{\prime}k^{\prime}}) - 1 + \sum\limits_{m = 0}^{M}(\lambda_{n^{\prime}m}x_{mk^{\prime}}) - \alpha_{k^{\prime}}$[{\displaystyle{\frac {\partial {\mathcal {L}}}{\partialp_{n'k'}}}=0=-\ln(p_{n'k'})-1+\sum _{m=0}^{M}(\lambda_{n'm}x_{mk'})-\alpha _{k'}}]Using the more condensed vector notation:$\sum\limits_{m = 0}^{M}\lambda_{nm}x_{mk} = \mathbf{\lambda}_{n} \cdot \mathbf{x}_{k}$[{\displaystyle\sum _{m=0}^{M}\lambda _{nm}x_{mk}={\boldsymbol {\lambda }}_{n}\cdot{\boldsymbol {x}}_{k}}]and dropping the primes on the n and k indices, and then solving forp_(nk)[{\displaystyle p_{nk}}] yields:p_(nk) = e^(λ_(n) ⋅ x_(k))/Z_(k)[{\displaystyle p_{nk}=e^{{\boldsymbol{\lambda }}_{n}\cdot {\boldsymbol {x}}_{k}}/Z_{k}}]where:Z_(k) = e^(1 + α_(k))[{\displaystyle Z_{k}=e^{1+\alpha _{k}}}]Imposing the normalization constraint, we can solve for the Z_(k) andwrite the probabilities as:$p_{nk} = \frac{e^{\mathbf{\lambda}_{n} \cdot \mathbf{x}_{k}}}{\sum\limits_{u = 0}^{N}e^{\mathbf{\lambda}_{u} \cdot \mathbf{x}_{k}}}$[{\displaystylep_{nk}={\frac {e^{{\boldsymbol {\lambda }}_{n}\cdot {\boldsymbol{x}}_{k}}}{\sum _{u=0}^{N}e^{{\boldsymbol {\lambda }}_{u}\cdot{\boldsymbol {x}}_{k}}}}}]The λ_(n)[{\displaystyle {\boldsymbol {\lambda }}_{n}}] are not allindependent. We can add any constant (M+1)[{\displaystyle(M+1)}]-dimensional vector to each of the λ_(n)[{\displaystyle{\boldsymbol {\lambda }}_{n}}] without changing the value of thep_(nk)[{\displaystyle p_{nk}}] probabilities so that there are only Nrather than N + 1[N+1] independent λ_(n)[{\displaystyle {\boldsymbol{\lambda }}_{n}}]. In the multinomial logistic regression section above,the λ₀[{\displaystyle {\boldsymbol {\lambda }}_{0}}] was subtracted fromeach λ_(n)[{\displaystyle {\boldsymbol {\lambda }}_{n}}] which set theexponential term involving λ₀[{\displaystyle {\boldsymbol {\lambda}}_{0}}] to unity, and the beta coefficients were given byβ_(n) = λ_(n) − λ₀[{\displaystyle {\boldsymbol {\beta}}_{n}={\boldsymbol {\lambda }}_{n}-{\boldsymbol {\lambda }}_{0}}].Other approaches[edit]In machine learning applications where logistic regression is used forbinary classification, the MLE minimises the Cross entropy lossfunction.Logistic regression is an important machine learning algorithm. The goalis to model the probability of a random variable Y[Y] being 0 or 1 givenexperimental data.^([37])Consider a generalized linear model function parameterized by θ[\theta],$h_{\theta}(X) = \frac{1}{1 + e^{- \theta^{T}X}} = \Pr(Y = 1 \mid X;\theta)$[{\displaystyleh_{\theta }(X)={\frac {1}{1+e^{-\theta ^{T}X}}}=\Pr(Y=1\mid X;\theta )}]Therefore,Pr (Y=0∣X;θ) = 1 − h_(θ)(X)[{\displaystyle \Pr(Y=0\mid X;\theta)=1-h_{\theta }(X)}]and since Y ∈ {0, 1}[{\displaystyle Y\in \{0,1\}}], we see thatPr (y∣X;θ)[{\displaystyle \Pr(y\mid X;\theta )}] is given byPr (y∣X;θ) = h_(θ)(X)^(y)(1−h_(θ)(X))^((1−y)).[{\displaystyle \Pr(y\midX;\theta )=h_{\theta }(X)^{y}(1-h_{\theta }(X))^{(1-y)}.}] We nowcalculate the likelihood function assuming that all the observations inthe sample are independently Bernoulli distributed,$\begin{matrix}{L(\theta \mid y;x)} & {= \Pr(Y \mid X;\theta)} \\ & {= \prod\limits_{i}\Pr(y_{i} \mid x_{i};\theta)} \\ & {= \prod\limits_{i}h_{\theta}(x_{i})^{y_{i}}(1 - h_{\theta}(x_{i}))^{(1 - y_{i})}} \\\end{matrix}$[{\displaystyle {\begin{aligned}L(\theta \midy;x)&=\Pr(Y\mid X;\theta )\\&=\prod _{i}\Pr(y_{i}\mid x_{i};\theta)\\&=\prod _{i}h_{\theta }(x_{i})^{y_{i}}(1-h_{\theta}(x_{i}))^{(1-y_{i})}\end{aligned}}}]Typically, the log likelihood is maximized,$N^{- 1}\log L(\theta \mid y;x) = N^{- 1}\sum\limits_{i = 1}^{N}\log\Pr(y_{i} \mid x_{i};\theta)$[{\displaystyleN^{-1}\log L(\theta \mid y;x)=N^{-1}\sum _{i=1}^{N}\log \Pr(y_{i}\midx_{i};\theta )}]which is maximized using optimization techniques such as gradientdescent.Assuming the (x,y)[(x,y)] pairs are drawn uniformly from the underlyingdistribution, then in the limit of large N,$\begin{matrix} & {\lim\limits_{N\rightarrow + \infty}N^{- 1}\sum\limits_{i = 1}^{N}\log\Pr(y_{i} \mid x_{i};\theta) = \sum\limits_{x \in \mathcal{X}}\sum\limits_{y \in \mathcal{Y}}\Pr(X = x,Y = y)\log\Pr(Y = y \mid X = x;\theta)} \\ = & {\sum\limits_{x \in \mathcal{X}}\sum\limits_{y \in \mathcal{Y}}\Pr(X = x,Y = y)\left( {- \log\frac{\Pr(Y = y \mid X = x)}{\Pr(Y = y \mid X = x;\theta)} + \log\Pr(Y = y \mid X = x)} \right)} \\ = & {- D_{\text{KL}}(Y \parallel Y_{\theta}) - H(Y \mid X)} \\\end{matrix}$[{\displaystyle {\begin{aligned}&\lim \limits_{N\rightarrow +\infty }N^{-1}\sum _{i=1}^{N}\log \Pr(y_{i}\midx_{i};\theta )=\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal{Y}}}\Pr(X=x,Y=y)\log \Pr(Y=y\mid X=x;\theta )\\[6pt]={}&\sum _{x\in{\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}\Pr(X=x,Y=y)\left(-\log {\frac{\Pr(Y=y\mid X=x)}{\Pr(Y=y\mid X=x;\theta )}}+\log \Pr(Y=y\midX=x)\right)\\[6pt]={}&-D_{\text{KL}}(Y\parallel Y_{\theta })-H(Y\midX)\end{aligned}}}]where H(Y∣X)[{\displaystyle H(Y\mid X)}] is the conditional entropy andD_(KL)[{\displaystyle D_{\text{KL}}}] is the Kullback–Leiblerdivergence. This leads to the intuition that by maximizing thelog-likelihood of a model, you are minimizing the KL divergence of yourmodel from the maximal entropy distribution. Intuitively searching forthe model that makes the fewest assumptions in its parameters.Comparison with linear regression[edit]Logistic regression can be seen as a special case of the generalizedlinear model and thus analogous to linear regression. The model oflogistic regression, however, is based on quite different assumptions(about the relationship between the dependent and independent variables)from those of linear regression. In particular, the key differencesbetween these two models can be seen in the following two features oflogistic regression. First, the conditional distribution y ∣ x[y\mid x]is a Bernoulli distribution rather than a Gaussian distribution, becausethe dependent variable is binary. Second, the predicted values areprobabilities and are therefore restricted to (0,1) through the logisticdistribution function because logistic regression predicts theprobability of particular outcomes rather than the outcomes themselves.Alternatives[edit]A common alternative to the logistic model (logit model) is the probitmodel, as the related names suggest. From the perspective of generalizedlinear models, these differ in the choice of link function: the logisticmodel uses the logit function (inverse logistic function), while theprobit model uses the probit function (inverse error function).Equivalently, in the latent variable interpretations of these twomethods, the first assumes a standard logistic distribution of errorsand the second a standard normal distribution of errors.^([38]) Othersigmoid functions or error distributions can be used instead.Logistic regression is an alternative to Fisher's 1936 method, lineardiscriminant analysis.^([39]) If the assumptions of linear discriminantanalysis hold, the conditioning can be reversed to produce logisticregression. The converse is not true, however, because logisticregression does not require the multivariate normal assumption ofdiscriminant analysis.^([40])The assumption of linear predictor effects can easily be relaxed usingtechniques such as spline functions.^([29])History[edit]A detailed history of the logistic regression is given in Cramer (2002).The logistic function was developed as a model of population growth andnamed "logistic" by Pierre François Verhulst in the 1830s and 1840s,under the guidance of Adolphe Quetelet; see Logistic function § Historyfor details.^([41]) In his earliest paper (1838), Verhulst did notspecify how he fit the curves to the data.^([42][43]) In his moredetailed paper (1845), Verhulst determined the three parameters of themodel by making the curve pass through three observed points, whichyielded poor predictions.^([44][45])The logistic function was independently developed in chemistry as amodel of autocatalysis (Wilhelm Ostwald, 1883).^([46]) An autocatalyticreaction is one in which one of the products is itself a catalyst forthe same reaction, while the supply of one of the reactants is fixed.This naturally gives rise to the logistic equation for the same reasonas population growth: the reaction is self-reinforcing but constrained.The logistic function was independently rediscovered as a model ofpopulation growth in 1920 by Raymond Pearl and Lowell Reed, published asPearl & Reed (1920) harvtxt error: no target: CITEREFPearlReed1920(help), which led to its use in modern statistics. They were initiallyunaware of Verhulst's work and presumably learned about it from L.Gustave du Pasquier, but they gave him little credit and did not adopthis terminology.^([47]) Verhulst's priority was acknowledged and theterm "logistic" revived by Udny Yule in 1925 and has been followedsince.^([48]) Pearl and Reed first applied the model to the populationof the United States, and also initially fitted the curve by making itpass through three points; as with Verhulst, this again yielded poorresults.^([49])In the 1930s, the probit model was developed and systematized by ChesterIttner Bliss, who coined the term "probit" in Bliss (1934) harvtxterror: no target: CITEREFBliss1934 (help), and by John Gaddum in Gaddum(1933) harvtxt error: no target: CITEREFGaddum1933 (help), and the modelfit by maximum likelihood estimation by Ronald A. Fisher in Fisher(1935) harvtxt error: no target: CITEREFFisher1935 (help), as anaddendum to Bliss's work. The probit model was principally used inbioassay, and had been preceded by earlier work dating to 1860; seeProbit model § History. The probit model influenced the subsequentdevelopment of the logit model and these models competed with eachother.^([50])The logistic model was likely first used as an alternative to the probitmodel in bioassay by Edwin Bidwell Wilson and his student Jane Worcesterin Wilson & Worcester (1943).^([51]) However, the development of thelogistic model as a general alternative to the probit model wasprincipally due to the work of Joseph Berkson over many decades,beginning in Berkson (1944) harvtxt error: no target: CITEREFBerkson1944(help), where he coined "logit", by analogy with "probit", andcontinuing through Berkson (1951) harvtxt error: no target:CITEREFBerkson1951 (help) and following years.^([52]) The logit modelwas initially dismissed as inferior to the probit model, but "graduallyachieved an equal footing with the logit",^([53]) particularly between1960 and 1970. By 1970, the logit model achieved parity with the probitmodel in use in statistics journals and thereafter surpassed it. Thisrelative popularity was due to the adoption of the logit outside ofbioassay, rather than displacing the probit within bioassay, and itsinformal use in practice; the logit's popularity is credited to thelogit model's computational simplicity, mathematical properties, andgenerality, allowing its use in varied fields.^([3])Various refinements occurred during that time, notably by David Cox, asin Cox (1958).^([4])The multinomial logit model was introduced independently in Cox (1966)and Theil (1969), which greatly increased the scope of application andthe popularity of the logit model.^([54]) In 1973 Daniel McFadden linkedthe multinomial logit to the theory of discrete choice, specificallyLuce's choice axiom, showing that the multinomial logit followed fromthe assumption of independence of irrelevant alternatives andinterpreting odds of alternatives as relative preferences;^([55]) thisgave a theoretical foundation for the logistic regression.^([54])Extensions[edit]There are large numbers of extensions:- Multinomial logistic regression (or multinomial logit) handles the case of a multi-way categorical dependent variable (with unordered values, also called "classification"). Note that the general case of having dependent variables with more than two values is termed polytomous regression.- Ordered logistic regression (or ordered logit) handles ordinal dependent variables (ordered values).- Mixed logit is an extension of multinomial logit that allows for correlations among the choices of the dependent variable.- An extension of the logistic model to sets of interdependent variables is the conditional random field.- Conditional logistic regression handles matched or stratified data when the strata are small. It is mostly used in the analysis of observational studies.Software[edit]Most statistical software can do binary logistic regression.- SPSS - [1] for basic logistic regression.- Stata- SAS - PROC LOGISTIC for basic logistic regression. - PROC CATMOD when all the variables are categorical. - PROC GLIMMIX for multilevel model logistic regression.- R - glm in the stats package (using family = binomial)^([56]) - lrm in the rms package - GLMNET package for an efficient implementation regularized logistic regression - lmer for mixed effects logistic regression - Rfast package command gm_logistic for fast and heavy calculations involving large scale data. - arm package for bayesian logistic regression- Python - Logit in the Statsmodels module. - LogisticRegression in the scikit-learn module. - LogisticRegressor in the TensorFlow module. - Full example of logistic regression in the Theano tutorial [2] - Bayesian Logistic Regression with ARD prior code, tutorial - Variational Bayes Logistic Regression with ARD prior code , tutorial - Bayesian Logistic Regression code, tutorial- NCSS - Logistic Regression in NCSS- Matlab - mnrfit in the Statistics and Machine Learning Toolbox (with "incorrect" coded as 2 instead of 0) - fminunc/fmincon, fitglm, mnrfit, fitclinear, mle can all do logistic regression.- Java (JVM) - LibLinear - Apache Flink - Apache Spark - SparkML supports Logistic Regression- FPGA - Logistic Regresesion IP core in HLS for FPGA.Notably, Microsoft Excel's statistics extension package does not includeit.See also[edit]- [icon]Mathematics portal- Logistic function- Discrete choice- Jarrow–Turnbull model- Limited dependent variable- Multinomial logit model- Ordered logit- Hosmer–Lemeshow test- Brier score- mlpack - contains a C++ implementation of logistic regression- Local case-control sampling- Logistic model treeReferences[edit]1. ^ Tolles, Juliana; Meurer, William J (2016). "Logistic Regression Relating Patient Characteristics to Outcomes". JAMA. 316 (5): 533–4. doi:10.1001/jama.2016.7653. ISSN 0098-7484. OCLC 6823603312. PMID 27483067.2. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) ^(g) ^(h) ^(i) ^(j) ^(k) Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.). Wiley. ISBN 978-0-471-35632-5.^([page needed])3. ^ ^(a) ^(b) Cramer 2002, p. 10–11.4. ^ ^(a) ^(b) Walker, SH; Duncan, DB (1967). "Estimation of the probability of an event as a function of several independent variables". Biometrika. 54 (1/2): 167–178. doi:10.2307/2333860. JSTOR 2333860.5. ^ Cramer 2002, p. 8.6. ^ Boyd, C. R.; Tolson, M. A.; Copes, W. S. (1987). "Evaluating trauma care: The TRISS method. Trauma Score and the Injury Severity Score". The Journal of Trauma. 27 (4): 370–378. doi:10.1097/00005373-198704000-00005. PMID 3106646.7. ^ Kologlu, M.; Elker, D.; Altun, H.; Sayek, I. (2001). "Validation of MPI and PIA II in two different groups of patients with secondary peritonitis". Hepato-Gastroenterology. 48 (37): 147–51. PMID 11268952.8. ^ Biondo, S.; Ramos, E.; Deiros, M.; Ragué, J. M.; De Oca, J.; Moreno, P.; Farran, L.; Jaurrieta, E. (2000). "Prognostic factors for mortality in left colonic peritonitis: A new scoring system". Journal of the American College of Surgeons. 191 (6): 635–42. doi:10.1016/S1072-7515(00)00758-4. PMID 11129812.9. ^ Marshall, J. C.; Cook, D. J.; Christou, N. V.; Bernard, G. R.; Sprung, C. L.; Sibbald, W. J. (1995). "Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome". Critical Care Medicine. 23 (10): 1638–52. doi:10.1097/00003246-199510000-00007. PMID 7587228.10. ^ Le Gall, J. R.; Lemeshow, S.; Saulnier, F. (1993). "A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study". JAMA. 270 (24): 2957–63. doi:10.1001/jama.1993.03510240069035. PMID 8254858.11. ^ ^(a) ^(b) David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press. p. 128.12. ^ Truett, J; Cornfield, J; Kannel, W (1967). "A multivariate analysis of the risk of coronary heart disease in Framingham". Journal of Chronic Diseases. 20 (7): 511–24. doi:10.1016/0021-9681(67)90082-3. PMID 6028270.13. ^ Harrell, Frank E. (2001). Regression Modeling Strategies (2nd ed.). Springer-Verlag. ISBN 978-0-387-95232-1.14. ^ M. Strano; B.M. Colosimo (2006). "Logistic regression analysis for experimental determination of forming limit diagrams". International Journal of Machine Tools and Manufacture. 46 (6): 673–682. doi:10.1016/j.ijmachtools.2005.07.005.15. ^ Palei, S. K.; Das, S. K. (2009). "Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: An approach". Safety Science. 47: 88–96. doi:10.1016/j.ssci.2008.01.002.16. ^ Berry, Michael J.A (1997). Data Mining Techniques For Marketing, Sales and Customer Support. Wiley. p. 10.17. ^ "How to Interpret Odds Ratio in Logistic Regression?". Institute for Digital Research and Education.18. ^ Everitt, Brian (1998). The Cambridge Dictionary of Statistics. Cambridge, UK New York: Cambridge University Press. ISBN 978-0521593465.19. ^ For example, the indicator function in this case could be defined as Δ(n,y) = 1 − (y−n)²[{\displaystyle \Delta (n,y)=1-(y-n)^{2}}]20. ^ Malouf, Robert (2002). "A comparison of algorithms for maximum entropy parameter estimation". Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002). pp. 49–55. doi:10.3115/1118853.1118871.21. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) ^(g) Menard, Scott W. (2002). Applied Logistic Regression (2nd ed.). SAGE. ISBN 978-0-7619-2208-7.^([page needed])22. ^ Gourieroux, Christian; Monfort, Alain (1981). "Asymptotic Properties of the Maximum Likelihood Estimator in Dichotomous Logit Models". Journal of Econometrics. 17 (1): 83–97. doi:10.1016/0304-4076(81)90060-9.23. ^ Park, Byeong U.; Simar, Léopold; Zelenyuk, Valentin (2017). "Nonparametric estimation of dynamic discrete choice models for time series data" (PDF). Computational Statistics & Data Analysis. 108: 97–120. doi:10.1016/j.csda.2016.10.024.24. ^ See e.g.. Murphy, Kevin P. (2012). Machine Learning – A Probabilistic Perspective. The MIT Press. pp. 245pp. ISBN 978-0-262-01802-9.25. ^ Van Smeden, M.; De Groot, J. A.; Moons, K. G.; Collins, G. S.; Altman, D. G.; Eijkemans, M. J.; Reitsma, J. B. (2016). "No rationale for 1 variable per 10 events criterion for binary logistic regression analysis". BMC Medical Research Methodology. 16 (1): 163. doi:10.1186/s12874-016-0267-3. PMC 5122171. PMID 27881078.26. ^ Peduzzi, P; Concato, J; Kemper, E; Holford, TR; Feinstein, AR (December 1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–9. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.27. ^ Vittinghoff, E.; McCulloch, C. E. (12 January 2007). "Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression". American Journal of Epidemiology. 165 (6): 710–718. doi:10.1093/aje/kwk052. PMID 17182981.28. ^ van der Ploeg, Tjeerd; Austin, Peter C.; Steyerberg, Ewout W. (2014). "Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints". BMC Medical Research Methodology. 14: 137. doi:10.1186/1471-2288-14-137. PMC 4289553. PMID 25532820.29. ^ ^(a) ^(b) Harrell, Frank E. (2015). Regression Modeling Strategies. Springer Series in Statistics (2nd ed.). New York; Springer. doi:10.1007/978-3-319-19425-7. ISBN 978-3-319-19424-0.30. ^ Greene, William N. (2003). Econometric Analysis (Fifth ed.). Prentice-Hall. ISBN 978-0-13-066189-0.31. ^ ^(a) ^(b) ^(c) ^(d) ^(e) ^(f) ^(g) ^(h) ^(i) ^(j) Cohen, Jacob; Cohen, Patricia; West, Steven G.; Aiken, Leona S. (2002). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Routledge. ISBN 978-0-8058-2223-6.^([page needed])32. ^ Allison, Paul D. "Measures of fit for logistic regression" (PDF). Statistical Horizons LLC and the University of Pennsylvania.33. ^ Hosmer, D.W. (1997). "A comparison of goodness-of-fit tests for the logistic regression model". Stat Med. 16 (9): 965–980. doi:10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.3.co;2-f. PMID 9160492.34. ^ Harrell, Frank E. (2010). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer. ISBN 978-1-4419-2918-1.^([page needed])35. ^ ^(a) ^(b) https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 1636. ^ ^(a) ^(b) Mount, J. (2011). "The Equivalence of Logistic Regression and Maximum Entropy models" (PDF). Retrieved Feb 23, 2022.37. ^ Ng, Andrew (2000). "CS229 Lecture Notes" (PDF). CS229 Lecture Notes: 16–19.38. ^ Rodríguez, G. (2007). Lecture Notes on Generalized Linear Models. pp. Chapter 3, page 45.39. ^ Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. 6.40. ^ Pohar, Maja; Blas, Mateja; Turk, Sandra (2004). "Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study". Metodološki Zvezki. 1 (1).41. ^ Cramer 2002, pp. 3–5.42. ^ Verhulst, Pierre-François (1838). "Notice sur la loi que la population poursuit dans son accroissement" (PDF). Correspondance Mathématique et Physique. 10: 113–121. Retrieved 3 December 2014.43. ^ Cramer 2002, p. 4, "He did not say how he fitted the curves."44. ^ Verhulst, Pierre-François (1845). "Recherches mathématiques sur la loi d'accroissement de la population" [Mathematical Researches into the Law of Population Growth Increase]. Nouveaux Mémoires de l'Académie Royale des Sciences et Belles-Lettres de Bruxelles. 18. Retrieved 2013-02-18.45. ^ Cramer 2002, p. 4.46. ^ Cramer 2002, p. 7.47. ^ Cramer 2002, p. 6.48. ^ Cramer 2002, p. 6–7.49. ^ Cramer 2002, p. 5.50. ^ Cramer 2002, p. 7–9.51. ^ Cramer 2002, p. 9.52. ^ Cramer 2002, p. 8, "As far as I can see the introduction of the logistics as an alternative to the normal probability function is the work of a single person, Joseph Berkson (1899–1982), ..."53. ^ Cramer 2002, p. 11.54. ^ ^(a) ^(b) Cramer, p. 13. sfn error: no target: CITEREFCramer (help)55. ^ McFadden, Daniel (1973). "Conditional Logit Analysis of Qualitative Choice Behavior" (PDF). In P. Zarembka (ed.). Frontiers in Econometrics. New York: Academic Press. pp. 105–142. Archived from the original (PDF) on 2018-11-27. Retrieved 2019-04-20.56. ^ Gelman, Andrew; Hill, Jennifer (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press. pp. 79–108. ISBN 978-0-521-68689-1.Further reading[edit]- Cox, David R. (1958). "The regression analysis of binary sequences (with discussion)". J R Stat Soc B. 20 (2): 215–242. JSTOR 2983890.- Cox, David R. (1966). "Some procedures connected with the logistic qualitative response curve". In F. N. David (1966) (ed.). Research Papers in Probability and Statistics (Festschrift for J. Neyman). London: Wiley. pp. 55–71.- Cramer, J. S. (2002). The origins of logistic regression (PDF) (Technical report). Vol. 119. Tinbergen Institute. pp. 167–178. doi:10.2139/ssrn.360300. - Published in: Cramer, J. S. (2004). "The early origins of the logit model". Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences. 35 (4): 613–626. doi:10.1016/j.shpsc.2004.09.003.- Theil, Henri (1969). "A Multinomial Extension of the Linear Logit Model". International Economic Review. 10 (3): 251–59. doi:10.2307/2525642. JSTOR 2525642.- Wilson, E.B.; Worcester, J. (1943). "The Determination of L.D.50 and Its Sampling Error in Bio-Assay". Proceedings of the National Academy of Sciences of the United States of America. 29 (2): 79–85. Bibcode:1943PNAS...29...79W. doi:10.1073/pnas.29.2.79. PMC 1078563. PMID 16588606.- Agresti, Alan. (2002). Categorical Data Analysis. New York: Wiley-Interscience. ISBN 978-0-471-36093-3.- Amemiya, Takeshi (1985). "Qualitative Response Models". Advanced Econometrics. Oxford: Basil Blackwell. pp. 267–359. ISBN 978-0-631-13345-2.- Balakrishnan, N. (1991). Handbook of the Logistic Distribution. Marcel Dekker, Inc. ISBN 978-0-8247-8587-1.- Gouriéroux, Christian (2000). "The Simple Dichotomy". Econometrics of Qualitative Dependent Variables. New York: Cambridge University Press. pp. 6–37. ISBN 978-0-521-58985-7.- Greene, William H. (2003). Econometric Analysis, fifth edition. Prentice Hall. ISBN 978-0-13-066189-0.- Hilbe, Joseph M. (2009). Logistic Regression Models. Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5.- Hosmer, David (2013). Applied logistic regression. Hoboken, New Jersey: Wiley. ISBN 978-0470582473.- Howell, David C. (2010). Statistical Methods for Psychology, 7th ed. Belmont, CA; Thomson Wadsworth. ISBN 978-0-495-59786-5.- Peduzzi, P.; J. Concato; E. Kemper; T.R. Holford; A.R. Feinstein (1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.- Berry, Michael J.A.; Linoff, Gordon (1997). Data Mining Techniques For Marketing, Sales and Customer Support. Wiley.External links[edit][]Wikiversity has learning resources about Logistic regression- [] Media related to Logistic regression at Wikimedia Commons- Econometrics Lecture (topic: Logit model) on YouTube by Mark Thoma- Logistic Regression tutorial- mlelr: software in C for teaching purposes+-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || Statistics | |+-----------------------------------+-----------------------------------+| - Outline | || - Index | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Descriptive | | | || | statistics | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Data | | | || | collection | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Statistical | | | || | inference | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | - | | | || | Correlation | | | || | - | | | || | Regression | | | || | analysis | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | C | | | || | ategorical / | | | || | Mu | | | || | ltivariate / | | | || | T | | | || | ime-series / | | | || | Survival | | | || | analysis | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Applications | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| - []Category | || - [icon] Mathematics portal | || - []Commons | || - [] WikiProject | |+-----------------------------------+-----------------------------------++-----------------------------------+-----------------------------------+| Authority control [Edit this at | || Wikidata] | |+-----------------------------------+-----------------------------------+| International | - FAST |+-----------------------------------+-----------------------------------+| National | - France || | - BnF data || | - Germany || | - Israel || | - United States |+-----------------------------------+-----------------------------------+Set of methods for supervised statistical learning+-----------------------------------------------------------------------+| Part of a series on |+-----------------------------------------------------------------------+| Machine learning || and data mining |+-----------------------------------------------------------------------+| [Scatterplot featuring a linear support vector machine's decision || boundary (dashed line)] |+-----------------------------------------------------------------------+| Paradigms || || - Supervised learning || - Unsupervised learning || - Online learning || - Batch learning || - Meta-learning || - Semi-supervised learning || - Self-supervised learning || - Reinforcement learning || - Rule-based learning || - Quantum machine learning |+-----------------------------------------------------------------------+| Problems || || - Classification || - Regression || - Clustering || - dimension reduction || - density estimation || - Anomaly detection || - Data Cleaning || - AutoML || - Association rules || - Semantic analysis || - Structured prediction || - Feature engineering || - Feature learning || - Learning to rank || - Grammar induction || - Ontology learning |+-----------------------------------------------------------------------+| Supervised learning || (classification • regression) || || - Decision trees || - Ensembles || - Bagging || - Boosting || - Random forest || - k-NN || - Linear regression || - Naive Bayes || - Artificial neural networks || - Logistic regression || - Perceptron || - Relevance vector machine (RVM) || - Support vector machine (SVM) |+-----------------------------------------------------------------------+| Clustering || || - BIRCH || - CURE || - Hierarchical || - k-means || - Fuzzy || - Expectation–maximization (EM) || - || DBSCAN || - OPTICS || - Mean shift |+-----------------------------------------------------------------------+| Dimensionality reduction || || - Factor analysis || - CCA || - ICA || - LDA || - NMF || - PCA || - PGD || - t-SNE || - SDL |+-----------------------------------------------------------------------+| Structured prediction || || - Graphical models || - Bayes net || - Conditional random field || - Hidden Markov |+-----------------------------------------------------------------------+| Anomaly detection || || - RANSAC || - k-NN || - Local outlier factor || - Isolation forest |+-----------------------------------------------------------------------+| Artificial neural network || || - Autoencoder || - Cognitive computing || - Deep learning || - DeepDream || - Multilayer perceptron || - RNN || - LSTM || - GRU || - ESN || - reservoir computing || - Restricted Boltzmann machine || - GAN || - SOM || - Convolutional neural network || - U-Net || - Transformer || - Vision || - Spiking neural network || - Memtransistor || - Electrochemical RAM (ECRAM) |+-----------------------------------------------------------------------+| Reinforcement learning || || - Q-learning || - SARSA || - Temporal difference (TD) || - Multi-agent || - Self-play |+-----------------------------------------------------------------------+| Learning with humans || || - Active learning || - Crowdsourcing || - Human-in-the-loop |+-----------------------------------------------------------------------+| Model diagnostics || || - Learning curve |+-----------------------------------------------------------------------+| Theory || || - Kernel machines || - Bias–variance tradeoff || - Computational learning theory || - Empirical risk minimization || - Occam learning || - PAC learning || - Statistical learning || - VC theory |+-----------------------------------------------------------------------+| Machine-learning venues || || - NeurIPS || - ICML || - ICLR || - ML || - JMLR |+-----------------------------------------------------------------------+| Related articles || || - Glossary of artificial intelligence || - List of datasets for machine-learning research || - Outline of machine learning |+-----------------------------------------------------------------------+| - v || - t || - e |+-----------------------------------------------------------------------+In machine learning, support vector machines (SVMs, also support vectornetworks^([1])) are supervised learning models with associated learningalgorithms that analyze data for classification and regression analysis.Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues(Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995,^([1])Vapnik et al., 1997^([citation needed])) SVMs are one of the most robustprediction methods, being based on statistical learning frameworks or VCtheory proposed by Vapnik (1982, 1995) and Chervonenkis (1974). Given aset of training examples, each marked as belonging to one of twocategories, an SVM training algorithm builds a model that assigns newexamples to one category or the other, making it a non-probabilisticbinary linear classifier (although methods such as Platt scaling existto use SVM in a probabilistic classification setting). SVM maps trainingexamples to points in space so as to maximise the width of the gapbetween the two categories. New examples are then mapped into that samespace and predicted to belong to a category based on which side of thegap they fall.In addition to performing linear classification, SVMs can efficientlyperform a non-linear classification using what is called the kerneltrick, implicitly mapping their inputs into high-dimensional featurespaces.The support vector clustering^([2]) algorithm, created by HavaSiegelmann and Vladimir Vapnik, applies the statistics of supportvectors, developed in the support vector machines algorithm, tocategorize unlabeled data.^([citation needed]) These data sets requireunsupervised learning approaches, which attempt to find naturalclustering of the data to groups and, then, to map new data according tothese clusters.Motivation[edit][]H₁ does not separate the classes. H₂ does, but only with a small margin.H₃ separates them with the maximal margin.Classifying data is a common task in machine learning.Suppose some givendata points each belong to one of two classes, and the goal is to decidewhich class a new data point will be in. In the case of support vectormachines, a data point is viewed as a p[p]-dimensional vector (a list ofp[p] numbers), and we want to know whether we can separate such pointswith a (p−1)[(p-1)]-dimensional hyperplane. This is called a linearclassifier. There are many hyperplanes that might classify the data. Onereasonable choice as the best hyperplane is the one that represents thelargest separation, or margin, between the two classes. So we choose thehyperplane so that the distance from it to the nearest data point oneach side is maximized. If such a hyperplane exists, it is known as themaximum-margin hyperplane and the linear classifier it defines is knownas a maximum-margin classifier; or equivalently, the perceptron ofoptimal stability.^([citation needed])More formally, a support vector machine constructs a hyperplane or setof hyperplanes in a high or infinite-dimensional space, which can beused for classification, regression, or other tasks like outliersdetection.^([3]) Intuitively, a good separation is achieved by thehyperplane that has the largest distance to the nearest training-datapoint of any class (so-called functional margin), since in general thelarger the margin, the lower the generalization error of theclassifier.^([4])[]Kernel machineWhereas the original problem may be stated in a finite-dimensionalspace, it often happens that the sets to discriminate are not linearlyseparable in that space. For this reason, it was proposed^([5]) that theoriginal finite-dimensional space be mapped into a muchhigher-dimensional space, presumably making the separation easier inthat space. To keep the computational load reasonable, the mappings usedby SVM schemes are designed to ensure that dot products of pairs ofinput data vectors may be computed easily in terms of the variables inthe original space, by defining them in terms of a kernel functionk(x,y)[{\displaystyle k(x,y)}] selected to suit the problem.^([6]) Thehyperplanes in the higher-dimensional space are defined as the set ofpoints whose dot product with a vector in that space is constant, wheresuch a set of vectors is an orthogonal (and thus minimal) set of vectorsthat defines a hyperplane. The vectors defining the hyperplanes can bechosen to be linear combinations with parameters α_(i)[\alpha _{i}] ofimages of feature vectors x_(i)[x_{i}] that occur in the data base. Withthis choice of a hyperplane, the points x[x] in the feature space thatare mapped into the hyperplane are defined by the relation$\sum\limits_{i}\alpha_{i}k(x_{i},x) = \text{constant}.$[{\displaystyle\textstyle \sum _{i}\alpha _{i}k(x_{i},x)={\text{constant}}.}] Note thatif k(x,y)[{\displaystyle k(x,y)}] becomes small as y[y] grows furtheraway from x[x], each term in the sum measures the degree of closeness ofthe test point x[x] to the corresponding data base point x_(i)[x_{i}].In this way, the sum of kernels above can be used to measure therelative nearness of each test point to the data points originating inone or the other of the sets to be discriminated. Note the fact that theset of points x[x] mapped into any hyperplane can be quite convoluted asa result, allowing much more complex discrimination between sets thatare not convex at all in the original space.Applications[edit]SVMs can be used to solve various real-world problems:- SVMs are helpful in text and hypertext categorization, as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.^([7]) Some methods for shallow semantic parsing are based on support vector machines.^([8])- Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback. This is also true for image segmentation systems, including those using a modified version SVM that uses the privileged approach as suggested by Vapnik.^([9][10])- Classification of satellite data like SAR data using supervised SVM.^([11])- Hand-written characters can be recognized using SVM.^([12][13])- The SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly. Permutation tests based on SVM weights have been suggested as a mechanism for interpretation of SVM models.^([14][15]) Support vector machine weights have also been used to interpret SVM models in the past.^([16]) Posthoc interpretation of support vector machine models in order to identify features used by the model to make predictions is a relatively new area of research with special significance in the biological sciences.History[edit]The original SVM algorithm was invented by Vladimir N. Vapnik and AlexeyYa. Chervonenkis in 1964.^([citation needed]) In 1992, Bernhard Boser,Isabelle Guyon and Vladimir Vapnik suggested a way to create nonlinearclassifiers by applying the kernel trick to maximum-marginhyperplanes.^([5]) The "soft margin" incarnation, as is commonly used insoftware packages, was proposed by Corinna Cortes and Vapnik in 1993 andpublished in 1995.^([1])Linear SVM[edit][]Maximum-margin hyperplane and margins for an SVM trained with samplesfrom two classes. Samples on the margin are called the support vectors.We are given a training dataset of n[n] points of the form(x₁,y₁), …, (x_(n),y_(n)),[{\displaystyle (\mathbf {x} _{1},y_{1}),\ldots ,(\mathbf {x}_{n},y_{n}),}]where the y_(i)[y_{i}] are either 1 or −1, each indicating the class towhich the point x_(i)[\mathbf {x} _{i}] belongs. Each x_(i)[\mathbf {x}_{i}] is a p[p]-dimensional real vector. We want to find the"maximum-margin hyperplane" that divides the group of pointsx_(i)[\mathbf {x} _{i}] for which y_(i) = 1[{\displaystyle y_{i}=1}]from the group of points for which y_(i) = − 1[{\displaystyley_{i}=-1}], which is defined so that the distance between the hyperplaneand the nearest point x_(i)[\mathbf {x} _{i}] from either group ismaximized.Any hyperplane can be written as the set of points x[\mathbf {x} ]satisfyingw^(T)x − b = 0,[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} -b=0,}]where w[\mathbf {w} ] is the (not necessarily normalized) normal vectorto the hyperplane. This is much like Hesse normal form, except thatw[\mathbf {w} ] is not necessarily a unit vector. The parameter$\frac{b}{\|\mathbf{w}\|}$[{\tfrac {b}{\|\mathbf {w} \|}}] determinesthe offset of the hyperplane from the origin along the normal vectorw[\mathbf {w} ].Hard-margin[edit]If the training data is linearly separable, we can select two parallelhyperplanes that separate the two classes of data, so that the distancebetween them is as large as possible. The region bounded by these twohyperplanes is called the "margin", and the maximum-margin hyperplane isthe hyperplane that lies halfway between them. With a normalized orstandardized dataset, these hyperplanes can be described by theequationsw^(T)x − b = 1[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x}-b=1}] (anything on or above this boundary is of one class, with label1)andw^(T)x − b = − 1[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x}-b=-1}] (anything on or below this boundary is of the other class, withlabel −1).Geometrically, the distance between these two hyperplanes is$\frac{2}{\|\mathbf{w}\|}$[{\tfrac {2}{\|\mathbf {w} \|}}],^([17]) so tomaximize the distance between the planes we want to minimize∥w∥[\|\mathbf {w} \|]. The distance is computed using the distance froma point to a plane equation. We also have to prevent data points fromfalling into the margin, we add the following constraint: for each i[i]eitherw^(T)x_(i) − b ≥ 1 , if y_(i) = 1,[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b\geq1\,,{\text{ if }}y_{i}=1,}]orw^(T)x_(i) − b ≤ − 1 , if y_(i) = − 1.[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b\leq-1\,,{\text{ if }}y_{i}=-1.}]These constraints state that each data point must lie on the correctside of the margin.This can be rewritten as+-----------------------+-----------------------+-----------------------+| y_( | --- --- --- | (1) || i)(w^(T)x_(i)−b) ≥ 1, | | || for all 1 ≤ i ≤ n. | | || | --- --- --- | || [{\displaystyle | | || y_{i}(\mathbf {w} | | || ^{\mathsf {T}}\mathbf | | || {x} _{i}-b)\geq | | || 1,\quad {\text{ for | | || all }}1\leq i\leq | | || n.}] | | |+-----------------------+-----------------------+-----------------------+We can put this together to get the optimization problem:$\begin{matrix} & \underset{\mathbf{w},\; b}{minimize} & & {\|\mathbf{w}\|_{2}^{2}} \\ & \text{subject\ to} & & {y_{i}(\mathbf{w}^{\top}\mathbf{x}_{i} - b) \geq 1\quad\forall i \in \{ 1,\ldots,n\}} \\\end{matrix}$[{\displaystyle {\begin{aligned}&{\underset {\mathbf {w},\;b}{\operatorname {minimize} }}&&\|\mathbf {w}\|_{2}^{2}\\&{\text{subject to}}&&y_{i}(\mathbf {w} ^{\top }\mathbf {x}_{i}-b)\geq 1\quad \forall i\in \{1,\dots ,n\}\end{aligned}}}]The w[\mathbf {w} ] and b[b] that solve this problem determine ourclassifier, x ↦ sgn(w^(T)x−b)[{\displaystyle \mathbf {x} \mapsto\operatorname {sgn}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} -b)}] wheresgn(⋅)[{\displaystyle \operatorname {sgn}(\cdot )}] is the signfunction.An important consequence of this geometric description is that themax-margin hyperplane is completely determined by those x_(i)[\mathbf{x} _{i}] that lie nearest to it. These x_(i)[\mathbf {x} _{i}] arecalled support vectors.Soft-margin[edit]To extend SVM to cases in which the data are not linearly separable, thehinge loss function is helpfulmax (0,1−y_(i)(w^(T)x_(i)−b)).[{\displaystyle \max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf{x} _{i}-b)\right).}]Note that y_(i)[y_{i}] is the i-th target (i.e., in this case, 1 or −1),and w^(T)x_(i) − b[{\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x}_{i}-b}] is the i-th output.This function is zero if the constraint in (1) is satisfied, in otherwords, if x_(i)[\mathbf {x} _{i}] lies on the correct side of themargin. For data on the wrong side of the margin, the function's valueis proportional to the distance from the margin.The goal of the optimization then is to minimize$$\lambda\|\mathbf{w}\|^{2} + \left\lbrack {\frac{1}{n}\sum\limits_{i = 1}^{n}\max\left( {0,1 - y_{i}(\mathbf{w}^{\mathsf{T}}\mathbf{x}_{i} - b)} \right)} \right\rbrack,$$[{\displaystyle \lambda \lVert \mathbf {w} \rVert ^{2}+\left[{\frac{1}{n}}\sum _{i=1}^{n}\max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf{T}}\mathbf {x} _{i}-b)\right)\right],}]where the parameter λ > 0[\lambda >0] determines the trade-off betweenincreasing the margin size and ensuring that the x_(i)[\mathbf {x} _{i}]lie on the correct side of the margin. By deconstructing the hinge loss,this optimization problem can be massaged into the following:$\begin{matrix} & \underset{\mathbf{w},\; b,\;\zeta}{minimize} & & {\|\mathbf{w}\|_{2}^{2} + C\sum\limits_{i = 1}^{n}\zeta_{i}} \\ & \text{subject\ to} & & {y_{i}(\mathbf{w}^{\top}\mathbf{x}_{i} - b) \geq 1 - \zeta_{i},\quad\zeta_{i} \geq 0\quad\forall i \in \{ 1,\ldots,n\}} \\\end{matrix}$[{\displaystyle {\begin{aligned}&{\underset {\mathbf {w},\;b,\;\mathbf {\zeta } }{\operatorname {minimize} }}&&\|\mathbf {w}\|_{2}^{2}+C\sum _{i=1}^{n}\zeta _{i}\\&{\text{subjectto}}&&y_{i}(\mathbf {w} ^{\top }\mathbf {x} _{i}-b)\geq 1-\zeta_{i},\quad \zeta _{i}\geq 0\quad \forall i\in \{1,\dots,n\}\end{aligned}}}]Thus, for large values of C[C], it will behave similar to thehard-margin SVM, if the input data are linearly classifiable, but willstill learn if a classification rule is viable or not. (λ[\lambda ] isinversely related to C[C], e.g. in LIBSVM.)Nonlinear Kernels[edit][]Kernel machineThe original maximum-margin hyperplane algorithm proposed by Vapnik in1963 constructed a linear classifier. However, in 1992, Bernhard Boser,Isabelle Guyon and Vladimir Vapnik suggested a way to create nonlinearclassifiers by applying the kernel trick (originally proposed byAizerman et al.^([18])) to maximum-margin hyperplanes.^([5]) Theresulting algorithm is formally similar, except that every dot productis replaced by a nonlinear kernel function. This allows the algorithm tofit the maximum-margin hyperplane in a transformed feature space. Thetransformation may be nonlinear and the transformed spacehigh-dimensional; although the classifier is a hyperplane in thetransformed feature space, it may be nonlinear in the original inputspace.It is noteworthy that working in a higher-dimensional feature spaceincreases the generalization error of support vector machines, althoughgiven enough samples the algorithm still performs well.^([19])Some common kernels include:- Polynomial (homogeneous): k(x_(i),x_(j)) = (x_(i)⋅x_(j))^(d)[{\displaystyle k(\mathbf {x} _{i},\mathbf {x} _{j})=(\mathbf {x} _{i}\cdot \mathbf {x} _{j})^{d}}]. Particularly, when d = 1[d=1], this becomes the linear kernel.- Polynomial (inhomogeneous): k(x_(i),x_(j)) = (x_(i)⋅x_(j)+r)^(d)[{\displaystyle k(\mathbf {x} _{i},\mathbf {x} _{j})=(\mathbf {x} _{i}\cdot \mathbf {x} _{j}+r)^{d}}].- Gaussian radial basis function: k(x_(i),x_(j)) = exp (−γ∥x_(i)−x_(j)∥²)[{\displaystyle k(\mathbf {x} _{i},\mathbf {x} _{j})=\exp \left(-\gamma \left\|\mathbf {x} _{i}-\mathbf {x} _{j}\right\|^{2}\right)}] for γ > 0[\gamma >0]. Sometimes parametrized using γ = 1/(2σ²)[{\displaystyle \gamma =1/(2\sigma ^{2})}].- Sigmoid function (Hyperbolic tangent): k(x_(i),x_(j)) = tanh (κx_(i)⋅x_(j)+c)[{\displaystyle k(\mathbf {x_{i}} ,\mathbf {x_{j}} )=\tanh(\kappa \mathbf {x} _{i}\cdot \mathbf {x} _{j}+c)}] for some (not every) κ > 0[\kappa >0] and c < 0[c < 0].The kernel is related to the transform φ(x_(i))[{\displaystyle \varphi(\mathbf {x} _{i})}] by the equationk(x_(i),x_(j)) = φ(x_(i)) ⋅ φ(x_(j))[{\displaystyle k(\mathbf {x}_{i},\mathbf {x} _{j})=\varphi (\mathbf {x} _{i})\cdot \varphi (\mathbf{x_{j}} )}]. The value w is also in the transformed space, with$\mathbf{w} = \sum\limits_{i}\alpha_{i}y_{i}\varphi(\mathbf{x}_{i})$[{\textstyle\mathbf {w} =\sum _{i}\alpha _{i}y_{i}\varphi (\mathbf {x} _{i})}]. Dotproducts with w for classification can again be computed by the kerneltrick, i.e.$\mathbf{w} \cdot \varphi(\mathbf{x}) = \sum\limits_{i}\alpha_{i}y_{i}k(\mathbf{x}_{i},\mathbf{x})$[{\textstyle\mathbf {w} \cdot \varphi (\mathbf {x} )=\sum _{i}\alpha_{i}y_{i}k(\mathbf {x} _{i},\mathbf {x} )}].Computing the SVM classifier[edit]Computing the (soft-margin) SVM classifier amounts to minimizing anexpression of the form+-----------------------+-----------------------+-----------------------+| $$\ | --- --- --- | (2) || left\lbrack {\frac{1} | | || {n}\sum\limits_{i = 1 | | || }^{n}\max\left( {0,1 | --- --- --- | || - y_{i}(\mathbf{w}^{\ | | || mathsf{T}}\mathbf{x}_ | | || {i} - b)} \right)} \r | | || ight\rbrack + \lambda | | || \|\mathbf{w}\|^{2}.$$ | | || | | || [{\displaystyle | | || \left[{\frac | | || {1}{n}}\sum | | || _{i=1}^{n}\max | | || \l | | || eft(0,1-y_{i}(\mathbf | | || {w} ^{\mathsf | | || {T}}\mathbf {x} | | || _{i}-b)\ | | || right)\right]+\lambda | | || \|\mathbf {w} | | || \|^{2}.}] | | |+-----------------------+-----------------------+-----------------------+We focus on the soft-margin classifier since, as noted above, choosing asufficiently small value for λ[\lambda ] yields the hard-marginclassifier for linearly classifiable input data. The classical approach,which involves reducing (2) to a quadratic programming problem, isdetailed below. Then, more recent approaches such as sub-gradientdescent and coordinate descent will be discussed.Primal[edit]Minimizing (2) can be rewritten as a constrained optimization problemwith a differentiable objective function in the following way.For each i ∈ {1, …, n}[{\displaystyle i\in \{1,\,\ldots ,\,n\}}] weintroduce a variableζ_(i) = max (0,1−y_(i)(w^(T)x_(i)−b))[{\displaystyle \zeta _{i}=\max\left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\right)}].Note that ζ_(i)[{\displaystyle \zeta _{i}}] is the smallest nonnegativenumber satisfying y_(i)(w^(T)x_(i)−b) ≥ 1 − ζ_(i).[{\displaystyley_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\geq 1-\zeta _{i}.}]Thus we can rewrite the optimization problem as follows$$\begin{matrix} & {\text{minimize~}\frac{1}{n}\sum\limits_{i = 1}^{n}\zeta_{i} + \lambda\|\mathbf{w}\|^{2}} \\ & {\text{subject\ to~}y_{i}\left( {\mathbf{w}^{\mathsf{T}}\mathbf{x}_{i} - b} \right) \geq 1 - \zeta_{i}\,\text{~and~}\,\zeta_{i} \geq 0,\,\text{for\ all~}i.} \\\end{matrix}$$[{\displaystyle {\begin{aligned}&{\text{minimize }}{\frac {1}{n}}\sum_{i=1}^{n}\zeta _{i}+\lambda \|\mathbf {w}\|^{2}\\[0.5ex]&{\text{subject to }}y_{i}\left(\mathbf {w} ^{\mathsf{T}}\mathbf {x} _{i}-b\right)\geq 1-\zeta _{i}\,{\text{ and }}\,\zeta_{i}\geq 0,\,{\text{for all }}i.\end{aligned}}}]This is called the primal problem.Dual[edit]By solving for the Lagrangian dual of the above problem, one obtains thesimplified problem$$\begin{matrix} & {\text{maximize}\,\, f(c_{1}\ldots c_{n}) = \sum\limits_{i = 1}^{n}c_{i} - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{n}y_{i}c_{i}(\mathbf{x}_{i}^{\mathsf{T}}\mathbf{x}_{j})y_{j}c_{j},} \\ & {\text{subject\ to~}\sum\limits_{i = 1}^{n}c_{i}y_{i} = 0,\,\text{and~}0 \leq c_{i} \leq \frac{1}{2n\lambda}\;\text{for\ all~}i.} \\\end{matrix}$$[{\displaystyle {\begin{aligned}&{\text{maximize}}\,\,f(c_{1}\ldotsc_{n})=\sum _{i=1}^{n}c_{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum_{j=1}^{n}y_{i}c_{i}(\mathbf {x} _{i}^{\mathsf {T}}\mathbf {x}_{j})y_{j}c_{j},\\&{\text{subject to }}\sum_{i=1}^{n}c_{i}y_{i}=0,\,{\text{and }}0\leq c_{i}\leq {\frac{1}{2n\lambda }}\;{\text{for all }}i.\end{aligned}}}]This is called the dual problem. Since the dual maximization problem isa quadratic function of the c_(i)[ c_i] subject to linear constraints,it is efficiently solvable by quadratic programming algorithms.Here, the variables c_(i)[ c_i] are defined such that$$\mathbf{w} = \sum\limits_{i = 1}^{n}c_{i}y_{i}\mathbf{x}_{i}.$$[{\displaystyle \mathbf {w} =\sum _{i=1}^{n}c_{i}y_{i}\mathbf {x}_{i}.}]Moreover, c_(i) = 0[{\displaystyle c_{i}=0}] exactly whenx_(i)[{\displaystyle \mathbf {x} _{i}}] lies on the correct side of themargin, and 0 < c_(i) < (2nλ)⁻¹[{\displaystyle 0<c_{i}<(2n\lambda)^{-1}}] when x_(i)[{\displaystyle \mathbf {x} _{i}}] lies on themargin's boundary. It follows that w[\mathbf {w} ] can be written as alinear combination of the support vectors.The offset, b[ b], can be recovered by finding an x_(i)[{\displaystyle\mathbf {x} _{i}}] on the margin's boundary and solvingy_(i)(w^(T)x_(i)−b) = 1 ⇔ b = w^(T)x_(i) − y_(i).[{\displaystyle y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x}_{i}-b)=1\iff b=\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-y_{i}.}](Note that y_(i)⁻¹ = y_(i)[{\displaystyle y_{i}^{-1}=y_{i}}] sincey_(i) = ± 1[{\displaystyle y_{i}=\pm 1}].)Kernel trick[edit]Main article: Kernel method[]A training example of SVM with kernel given by φ((a, b)) = (a, b, a² +b²)Suppose now that we would like to learn a nonlinear classification rulewhich corresponds to a linear classification rule for the transformeddata points φ(x_(i)).[{\displaystyle \varphi (\mathbf {x} _{i}).}]Moreover, we are given a kernel function k[ k] which satisfiesk(x_(i),x_(j)) = φ(x_(i)) ⋅ φ(x_(j))[{\displaystyle k(\mathbf {x}_{i},\mathbf {x} _{j})=\varphi (\mathbf {x} _{i})\cdot \varphi (\mathbf{x} _{j})}].We know the classification vector w[\mathbf {w} ] in the transformedspace satisfies$$\mathbf{w} = \sum\limits_{i = 1}^{n}c_{i}y_{i}\varphi(\mathbf{x}_{i}),$$[{\displaystyle \mathbf {w} =\sum _{i=1}^{n}c_{i}y_{i}\varphi (\mathbf{x} _{i}),}]where, the c_(i)[c_{i}] are obtained by solving the optimization problem$$\begin{matrix}{\text{maximize}\,\, f(c_{1}\ldots c_{n})} & {= \sum\limits_{i = 1}^{n}c_{i} - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{n}y_{i}c_{i}(\varphi(\mathbf{x}_{i}) \cdot \varphi(\mathbf{x}_{j}))y_{j}c_{j}} \\ & {= \sum\limits_{i = 1}^{n}c_{i} - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{n}y_{i}c_{i}k(\mathbf{x}_{i},\mathbf{x}_{j})y_{j}c_{j}} \\{\text{subject\ to~}\sum\limits_{i = 1}^{n}c_{i}y_{i}} & {= 0,\,\text{and~}0 \leq c_{i} \leq \frac{1}{2n\lambda}\;\text{for\ all~}i.} \\\end{matrix}$$[{\displaystyle {\begin{aligned}{\text{maximize}}\,\,f(c_{1}\ldotsc_{n})&=\sum _{i=1}^{n}c_{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum_{j=1}^{n}y_{i}c_{i}(\varphi (\mathbf {x} _{i})\cdot \varphi (\mathbf{x} _{j}))y_{j}c_{j}\\&=\sum _{i=1}^{n}c_{i}-{\frac {1}{2}}\sum_{i=1}^{n}\sum _{j=1}^{n}y_{i}c_{i}k(\mathbf {x} _{i},\mathbf {x}_{j})y_{j}c_{j}\\{\text{subject to }}\sum_{i=1}^{n}c_{i}y_{i}&=0,\,{\text{and }}0\leq c_{i}\leq {\frac{1}{2n\lambda }}\;{\text{for all }}i.\end{aligned}}}]The coefficients c_(i)[ c_i] can be solved for using quadraticprogramming, as before. Again, we can find some index i[i] such that0 < c_(i) < (2nλ)⁻¹[{\displaystyle 0<c_{i}<(2n\lambda )^{-1}}], so thatφ(x_(i))[{\displaystyle \varphi (\mathbf {x} _{i})}] lies on theboundary of the margin in the transformed space, and then solve$$\begin{matrix}{b = \mathbf{w}^{\mathsf{T}}\varphi(\mathbf{x}_{i}) - y_{i}} & {= \left\lbrack {\sum\limits_{j = 1}^{n}c_{j}y_{j}\varphi(\mathbf{x}_{j}) \cdot \varphi(\mathbf{x}_{i})} \right\rbrack - y_{i}} \\ & {= \left\lbrack {\sum\limits_{j = 1}^{n}c_{j}y_{j}k(\mathbf{x}_{j},\mathbf{x}_{i})} \right\rbrack - y_{i}.} \\\end{matrix}$$[{\displaystyle {\begin{aligned}b=\mathbf {w} ^{\mathsf {T}}\varphi(\mathbf {x} _{i})-y_{i}&=\left[\sum _{j=1}^{n}c_{j}y_{j}\varphi(\mathbf {x} _{j})\cdot \varphi (\mathbf {x}_{i})\right]-y_{i}\\&=\left[\sum _{j=1}^{n}c_{j}y_{j}k(\mathbf {x}_{j},\mathbf {x} _{i})\right]-y_{i}.\end{aligned}}}]Finally,$$\mathbf{z}\mapsto{sgn}(\mathbf{w}^{\mathsf{T}}\varphi(\mathbf{z}) - b) = {sgn}\left( {\left\lbrack {\sum\limits_{i = 1}^{n}c_{i}y_{i}k(\mathbf{x}_{i},\mathbf{z})} \right\rbrack - b} \right).$$[{\displaystyle \mathbf {z} \mapsto \operatorname {sgn}(\mathbf {w}^{\mathsf {T}}\varphi (\mathbf {z} )-b)=\operatorname {sgn}\left(\left[\sum _{i=1}^{n}c_{i}y_{i}k(\mathbf {x} _{i},\mathbf {z})\right]-b\right).}]Modern methods[edit]Recent algorithms for finding the SVM classifier include sub-gradientdescent and coordinate descent. Both techniques have proven to offersignificant advantages over the traditional approach when dealing withlarge, sparse datasets—sub-gradient methods are especially efficientwhen there are many training examples, and coordinate descent when thedimension of the feature space is high.Sub-gradient descent[edit]Sub-gradient descent algorithms for the SVM work directly with theexpression$$f(\mathbf{w},b) = \left\lbrack {\frac{1}{n}\sum\limits_{i = 1}^{n}\max\left( {0,1 - y_{i}(\mathbf{w}^{\mathsf{T}}\mathbf{x}_{i} - b)} \right)} \right\rbrack + \lambda\|\mathbf{w}\|^{2}.$$[{\displaystyle f(\mathbf {w} ,b)=\left[{\frac {1}{n}}\sum_{i=1}^{n}\max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x}_{i}-b)\right)\right]+\lambda \|\mathbf {w} \|^{2}.}]Note that f[f] is a convex function of w[\mathbf {w} ] and b[b]. Assuch, traditional gradient descent (or SGD) methods can be adapted,where instead of taking a step in the direction of the function'sgradient, a step is taken in the direction of a vector selected from thefunction's sub-gradient. This approach has the advantage that, forcertain implementations, the number of iterations does not scale withn[n], the number of data points.^([20])Coordinate descent[edit]Coordinate descent algorithms for the SVM work from the dual problem$$\begin{matrix} & {\text{maximize}\,\, f(c_{1}\ldots c_{n}) = \sum\limits_{i = 1}^{n}c_{i} - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{n}y_{i}c_{i}(x_{i} \cdot x_{j})y_{j}c_{j},} \\ & {\text{subject\ to~}\sum\limits_{i = 1}^{n}c_{i}y_{i} = 0,\,\text{and~}0 \leq c_{i} \leq \frac{1}{2n\lambda}\;\text{for\ all~}i.} \\\end{matrix}$$[{\displaystyle {\begin{aligned}&{\text{maximize}}\,\,f(c_{1}\ldotsc_{n})=\sum _{i=1}^{n}c_{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum_{j=1}^{n}y_{i}c_{i}(x_{i}\cdot x_{j})y_{j}c_{j},\\&{\text{subject to}}\sum _{i=1}^{n}c_{i}y_{i}=0,\,{\text{and }}0\leq c_{i}\leq {\frac{1}{2n\lambda }}\;{\text{for all }}i.\end{aligned}}}]For each i ∈ {1, …, n}[{\displaystyle i\in \{1,\,\ldots ,\,n\}}],iteratively, the coefficient c_(i)[ c_i] is adjusted in the direction of∂f/∂c_(i)[{\displaystyle \partial f/\partial c_{i}}]. Then, theresulting vector of coefficients (c₁^(′), …, c_(n)^(′))[{\displaystyle(c_{1}',\,\ldots ,\,c_{n}')}] is projected onto the nearest vector ofcoefficients that satisfies the given constraints. (Typically Euclideandistances are used.) The process is then repeated until a near-optimalvector of coefficients is obtained. The resulting algorithm is extremelyfast in practice, although few performance guarantees have beenproven.^([21])Empirical risk minimization[edit]The soft-margin support vector machine described above is an example ofan empirical risk minimization (ERM) algorithm for the hinge loss. Seenthis way, support vector machines belong to a natural class ofalgorithms for statistical inference, and many of its unique featuresare due to the behavior of the hinge loss. This perspective can providefurther insight into how and why SVMs work, and allow us to betteranalyze their statistical properties.Risk minimization[edit]In supervised learning, one is given a set of training examplesX₁…X_(n)[X_{1}\ldots X_{n}] with labels y₁…y_(n)[y_{1}\ldots y_{n}], andwishes to predict y_(n + 1)[y_{n+1}] given X_(n + 1)[X_{{n+1}}]. To doso one forms a hypothesis, f[f], such that f(X_(n + 1))[{\displaystylef(X_{n+1})}] is a "good" approximation of y_(n + 1)[y_{n+1}]. A "good"approximation is usually defined with the help of a loss function,ℓ(y,z)[\ell (y,z)], which characterizes how bad z[z] is as a predictionof y[y]. We would then like to choose a hypothesis that minimizes theexpected risk:ε(f) = 𝔼[ℓ(y_(n + 1),f(X_(n + 1)))].[{\displaystyle \varepsilon (f)=\mathbb {E} \left[\ell(y_{n+1},f(X_{n+1}))\right].}]In most cases, we don't know the joint distribution ofX_(n + 1), y_(n + 1)[{\displaystyle X_{n+1},\,y_{n+1}}] outright. Inthese cases, a common strategy is to choose the hypothesis thatminimizes the empirical risk:$$\hat{\varepsilon}(f) = \frac{1}{n}\sum\limits_{k = 1}^{n}\ell(y_{k},f(X_{k})).$$[{\displaystyle {\hat {\varepsilon }}(f)={\frac {1}{n}}\sum_{k=1}^{n}\ell (y_{k},f(X_{k})).}]Under certain assumptions about the sequence of random variablesX_(k), y_(k)[X_{k},\,y_{k}] (for example, that they are generated by afinite Markov process), if the set of hypotheses being considered issmall enough, the minimizer of the empirical risk will closelyapproximate the minimizer of the expected risk as n[n] grows large. Thisapproach is called empirical risk minimization, or ERM.Regularization and stability[edit]In order for the minimization problem to have a well-defined solution,we have to place constraints on the set ℋ[{\mathcal {H}}] of hypothesesbeing considered. If ℋ[{\mathcal {H}}] is a normed space (as is the casefor SVM), a particularly effective technique is to consider only thosehypotheses f[f] for which ∥f∥_(ℋ) < k[\lVert f\rVert _{\mathcal {H}}<k]. This is equivalent to imposing a regularization penaltyℛ(f) = λ_(k)∥f∥_(ℋ)[{\mathcal {R}}(f)=\lambda _{k}\lVert f\rVert_{\mathcal {H}}], and solving the new optimization problem$$\hat{f} = {arg}\min\limits_{f \in \mathcal{H}}\hat{\varepsilon}(f) + \mathcal{R}(f).$$[{\displaystyle {\hat {f}}=\mathrm {arg} \min _{f\in {\mathcal{H}}}{\hat {\varepsilon }}(f)+{\mathcal {R}}(f).}]This approach is called Tikhonov regularization.More generally, ℛ(f)[{\mathcal {R}}(f)] can be some measure of thecomplexity of the hypothesis f[f], so that simpler hypotheses arepreferred.SVM and the hinge loss[edit]Recall that the (soft-margin) SVM classifier$\hat{\mathbf{w}},b:\mathbf{x}\mapsto{sgn}({\hat{\mathbf{w}}}^{\mathsf{T}}\mathbf{x} - b)$[{\displaystyle{\hat {\mathbf {w} }},b:\mathbf {x} \mapsto \operatorname {sgn}({\hat{\mathbf {w} }}^{\mathsf {T}}\mathbf {x} -b)}] is chosen to minimize thefollowing expression:$$\left\lbrack {\frac{1}{n}\sum\limits_{i = 1}^{n}\max\left( {0,1 - y_{i}(\mathbf{w}^{\mathsf{T}}\mathbf{x} - b)} \right)} \right\rbrack + \lambda\|\mathbf{w}\|^{2}.$$[{\displaystyle \left[{\frac {1}{n}}\sum _{i=1}^{n}\max\left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x}-b)\right)\right]+\lambda \|\mathbf {w} \|^{2}.}]In light of the above discussion, we see that the SVM technique isequivalent to empirical risk minimization with Tikhonov regularization,where in this case the loss function is the hinge lossℓ(y,z) = max (0,1−yz).[{\displaystyle \ell (y,z)=\max \left(0,1-yz\right).}]From this perspective, SVM is closely related to other fundamentalclassification algorithms such as regularized least-squares and logisticregression. The difference between the three lies in the choice of lossfunction: regularized least-squares amounts to empirical riskminimization with the square-loss, ℓ_(sq)(y,z) = (y−z)²[\ell_{sq}(y,z)=(y-z)^{2}]; logistic regression employs the log-loss,ℓ_(log)(y,z) = ln (1+e^(−yz)).[{\displaystyle \ell _{\log }(y,z)=\ln(1+e^{-yz}).}]Target functions[edit]The difference between the hinge loss and these other loss functions isbest stated in terms of target functions - the function that minimizesexpected risk for a given pair of random variables X, y[{\displaystyleX,\,y}].In particular, let y_(x)[y_{x}] denote y[y] conditional on the eventthat X = x[X=x]. In the classification setting, we have:$$y_{x} = \left\{ \begin{matrix}1 & {\text{with\ probability~}p_{x}} \\{- 1} & {\text{with\ probability~}1 - p_{x}} \\\end{matrix} \right.$$[{\displaystyle y_{x}={\begin{cases}1&{\text{with probability}}p_{x}\\-1&{\text{with probability }}1-p_{x}\end{cases}}}]The optimal classifier is therefore:$$f^{\ast}(x) = \left\{ \begin{matrix}1 & {\text{if~}p_{x} \geq 1/2} \\{- 1} & \text{otherwise} \\\end{matrix} \right.$$[{\displaystyle f^{*}(x)={\begin{cases}1&{\text{if }}p_{x}\geq1/2\\-1&{\text{otherwise}}\end{cases}}}]For the square-loss, the target function is the conditional expectationfunction, f_(sq)(x) = 𝔼[y_(x)][f_{sq}(x)=\mathbb {E}\left[y_{x}\right]]; For the logistic loss, it's the logit function,f_(log)(x) = ln (p_(x)/(1−p_(x)))[{\displaystyle f_{\log }(x)=\ln\left(p_{x}/({1-p_{x}})\right)}]. While both of these target functionsyield the correct classifier, assgn(f_(sq)) = sgn(f_(log)) = f^(*)[{\displaystyle \operatorname{sgn}(f_{sq})=\operatorname {sgn}(f_{\log })=f^{*}}], they give us moreinformation than we need. In fact, they give us enough information tocompletely describe the distribution of y_(x)[y_{x}].On the other hand, one can check that the target function for the hingeloss is exactly f^(*)[f^{*}]. Thus, in a sufficiently rich hypothesisspace—or equivalently, for an appropriately chosen kernel—the SVMclassifier will converge to the simplest function (in terms ofℛ[{\mathcal {R}}]) that correctly classifies the data. This extends thegeometric interpretation of SVM—for linear classification, the empiricalrisk is minimized by any function whose margins lie between the supportvectors, and the simplest of these is the max-margin classifier.^([22])Properties[edit]SVMs belong to a family of generalized linear classifiers and can beinterpreted as an extension of the perceptron. They can also beconsidered a special case of Tikhonov regularization. A special propertyis that they simultaneously minimize the empirical classification errorand maximize the geometric margin; hence they are also known as maximummargin classifiers.A comparison of the SVM to other classifiers has been made by Meyer,Leisch and Hornik.^([23])Parameter selection[edit]The effectiveness of SVM depends on the selection of kernel, thekernel's parameters, and soft margin parameter λ[\lambda ].A commonchoice is a Gaussian kernel, which has a single parameter γ[\gamma ].The best combination of λ[\lambda ] and γ[\gamma ] is often selected bya grid search with exponentially growing sequences of λ[\lambda ] andγ[\gamma ], for example, λ ∈ {2⁻⁵, 2⁻³, …, 2¹³, 2¹⁵}[{\displaystyle\lambda \in \{2^{-5},2^{-3},\dots ,2^{13},2^{15}\}}];γ ∈ {2⁻¹⁵, 2⁻¹³, …, 2¹, 2³}[\gamma \in \{2^{-15},2^{-13},\dots,2^{1},2^{3}\}]. Typically, each combination of parameter choices ischecked using cross validation, and the parameters with bestcross-validation accuracy are picked. Alternatively, recent work inBayesian optimization can be used to select λ[\lambda ] and γ[\gamma ] ,often requiring the evaluation of far fewer parameter combinations thangrid search. The final model, which is used for testing and forclassifying new data, is then trained on the whole training set usingthe selected parameters.^([24])Issues[edit]Potential drawbacks of the SVM include the following aspects:- Requires full labeling of input data- Uncalibrated class membership probabilities—SVM stems from Vapnik's theory which avoids estimating probabilities on finite data- The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the multi-class task to several binary problems have to be applied; see the multi-class SVM section.- Parameters of a solved model are difficult to interpret.Extensions[edit]Support vector clustering (SVC)[edit]SVC is a similar method that also builds on kernel functions but isappropriate for unsupervised learning.^([citation needed])Multiclass SVM[edit]Multiclass SVM aims to assign labels to instances by using supportvector machines, where the labels are drawn from a finite set of severalelements.The dominant approach for doing so is to reduce the single multiclassproblem into multiple binary classification problems.^([25]) Commonmethods for such reduction include:^([25][26])- Building binary classifiers that distinguish between one of the labels and the rest (one-versus-all) or between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest-output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.- Directed acyclic graph SVM (DAGSVM)^([27])- Error-correcting output codes^([28])Crammer and Singer proposed a multiclass SVM method which casts themulticlass classification problem into a single optimization problem,rather than decomposing it into multiple binary classificationproblems.^([29]) See also Lee, Lin and Wahba^([30][31]) and Van den Burgand Groenen.^([32])Transductive support vector machines[edit]Transductive support vector machines extend SVMs in that they could alsotreat partially labeled data in semi-supervised learning by followingthe principles of transduction. Here, in addition to the training set𝒟[{\mathcal {D}}], the learner is also given a set𝒟^(⋆) = {x_(i)^(⋆) ∣ x_(i)^(⋆) ∈ ℝ^(p)}_(i = 1)^(k)[{\displaystyle {\mathcal {D}}^{\star }=\{\mathbf {x} _{i}^{\star }\mid\mathbf {x} _{i}^{\star }\in \mathbb {R} ^{p}\}_{i=1}^{k}}]of test examples to be classified. Formally, a transductive supportvector machine is defined by the following primal optimizationproblem:^([33])Minimize (in w, b, y^(⋆)[{\displaystyle \mathbf {w} ,b,\mathbf {y}^{\star }}])$$\frac{1}{2}\|\mathbf{w}\|^{2}$$[{\displaystyle {\frac {1}{2}}\|\mathbf {w} \|^{2}}]subject to (for any i = 1, …, n[i=1,\dots ,n] and anyj = 1, …, k[j=1,\dots ,k])$$\begin{matrix} & {y_{i}(\mathbf{w} \cdot \mathbf{x}_{i} - b) \geq 1,} \\ & {y_{j}^{\star}(\mathbf{w} \cdot \mathbf{x}_{j}^{\star} - b) \geq 1,} \\\end{matrix}$$[{\displaystyle {\begin{aligned}&y_{i}(\mathbf {w} \cdot \mathbf {x}_{i}-b)\geq 1,\\&y_{j}^{\star }(\mathbf {w} \cdot \mathbf {x}_{j}^{\star }-b)\geq 1,\end{aligned}}}]andy_(j)^(⋆) ∈ { − 1, 1}.[{\displaystyle y_{j}^{\star }\in \{-1,1\}.}]Transductive support vector machines were introduced by Vladimir N.Vapnik in 1998.Structured SVM[edit]SVMs have been generalized to structured SVMs, where the label space isstructured and of possibly infinite size.Regression[edit][]Support vector regression (prediction) with different thresholds ε. As εincreases, the prediction becomes less sensitive to errors.A version of SVM for regression was proposed in 1996 by Vladimir N.Vapnik, Harris Drucker, Christopher J. C. Burges, Linda Kaufman andAlexander J. Smola.^([34]) This method is called support vectorregression (SVR). The model produced by support vector classification(as described above) depends only on a subset of the training data,because the cost function for building the model does not care abouttraining points that lie beyond the margin. Analogously, the modelproduced by SVR depends only on a subset of the training data, becausethe cost function for building the model ignores any training data closeto the model prediction. Another SVM version known as least-squaressupport vector machine (LS-SVM) has been proposed by Suykens andVandewalle.^([35])Training the original SVR means solving^([36])minimize $\frac{1}{2}\| w\|^{2}$[{\displaystyle {\tfrac{1}{2}}\|w\|^{2}}]subject to |y_(i)−⟨w,x_(i)⟩−b| ≤ ε[{\displaystyle |y_{i}-\langlew,x_{i}\rangle -b|\leq \varepsilon }]where x_(i)[x_{i}] is a training sample with target value y_(i)[y_{i}].The inner product plus intercept ⟨w, x_(i)⟩ + b[\langle w,x_{i}\rangle+b] is the prediction for that sample, and ε[\varepsilon ] is a freeparameter that serves as a threshold: all predictions have to be withinan ε[\varepsilon ] range of the true predictions. Slack variables areusually added into the above to allow for errors and to allowapproximation in the case the above problem is infeasible.Bayesian SVM[edit]In 2011 it was shown by Polson and Scott that the SVM admits a Bayesianinterpretation through the technique of data augmentation.^([37]) Inthis approach the SVM is viewed as a graphical model (where theparameters are connected via probability distributions). This extendedview allows the application of Bayesian techniques to SVMs, such asflexible feature modeling, automatic hyperparameter tuning, andpredictive uncertainty quantification. Recently, a scalable version ofthe Bayesian SVM was developed by Florian Wenzel, enabling theapplication of Bayesian SVMs to big data.^([38]) Florian Wenzeldeveloped two different versions, a variational inference (VI) schemefor the Bayesian kernel support vector machine (SVM) and a stochasticversion (SVI) for the linear Bayesian SVM.^([39])Implementation[edit]The parameters of the maximum-margin hyperplane are derived by solvingthe optimization. There exist several specialized algorithms for quicklysolving the quadratic programming (QP) problem that arises from SVMs,mostly relying on heuristics for breaking the problem down into smaller,more manageable chunks.Another approach is to use an interior-point method that usesNewton-like iterations to find a solution of the Karush–Kuhn–Tuckerconditions of the primal and dual problems.^([40])Instead of solving asequence of broken-down problems, this approach directly solves theproblem altogether. To avoid solving a linear system involving the largekernel matrix, a low-rank approximation to the matrix is often used inthe kernel trick.Another common method is Platt's sequential minimal optimization (SMO)algorithm, which breaks the problem down into 2-dimensional sub-problemsthat are solved analytically, eliminating the need for a numericaloptimization algorithm and matrix storage. This algorithm isconceptually simple, easy to implement, generally faster, and has betterscaling properties for difficult SVM problems.^([41])The special case of linear support vector machines can be solved moreefficiently by the same kind of algorithms used to optimize its closecousin, logistic regression; this class of algorithms includessub-gradient descent (e.g., PEGASOS^([42])) and coordinate descent(e.g., LIBLINEAR^([43])). LIBLINEAR has some attractive training-timeproperties. Each convergence iteration takes time linear in the timetaken to read the train data, and the iterations also have a Q-linearconvergence property, making the algorithm extremely fast.The general kernel SVMs can also be solved more efficiently usingsub-gradient descent (e.g. P-packSVM^([44])), especially whenparallelization is allowed.Kernel SVMs are available in many machine-learning toolkits, includingLIBSVM, MATLAB, SAS, SVMlight, kernlab, scikit-learn, Shogun, Weka,Shark, JKernelMachines, OpenCV and others.Preprocessing of data (standardization) is highly recommended to enhanceaccuracy of classification.^([45]) There are a few methods ofstandardization, such as min-max, normalization by decimal scaling,Z-score.^([46]) Subtraction of mean and division by variance of eachfeature is usually used for SVM.^([47])See also[edit]- In situ adaptive tabulation- Kernel machines- Fisher kernel- Platt scaling- Polynomial kernel- Predictive analytics- Regularization perspectives on support vector machines- Relevance vector machine, a probabilistic sparse-kernel model identical in functional form to SVM- Sequential minimal optimization- Space mapping- Winnow (algorithm)References[edit]1. ^ ^(a) ^(b) ^(c) Cortes, Corinna; Vapnik, Vladimir (1995). "Support-vector networks" (PDF). Machine Learning. 20 (3): 273–297. CiteSeerX 10.1.1.15.9362. doi:10.1007/BF00994018. S2CID 206787478.2. ^ Ben-Hur, Asa; Horn, David; Siegelmann, Hava; Vapnik, Vladimir N. ""Support vector clustering" (2001);". Journal of Machine Learning Research. 2: 125–137.3. ^ "1.4. Support Vector Machines — scikit-learn 0.20.2 documentation". Archived from the original on 2017-11-08. Retrieved 2017-11-08.4. ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning : Data Mining, Inference, and Prediction (PDF) (Second ed.). New York: Springer. p. 134.5. ^ ^(a) ^(b) ^(c) Boser, Bernhard E.; Guyon, Isabelle M.; Vapnik, Vladimir N. (1992). "A training algorithm for optimal margin classifiers". Proceedings of the fifth annual workshop on Computational learning theory – COLT '92. p. 144. CiteSeerX 10.1.1.21.3818. doi:10.1145/130385.130401. ISBN 978-0897914970. S2CID 207165665.6. ^ Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; Flannery, Brian P. (2007). "Section 16.5. Support Vector Machines". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. Archived from the original on 2011-08-11.7. ^ Joachims, Thorsten (1998). "Text categorization with Support Vector Machines: Learning with many relevant features". Machine Learning: ECML-98. Lecture Notes in Computer Science. Springer. 1398: 137–142. doi:10.1007/BFb0026683. ISBN 978-3-540-64417-0.8. ^ Pradhan, Sameer S.; et al. (2 May 2004). Shallow Semantic Parsing using Support Vector Machines. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. Association for Computational Linguistics. pp. 233–240.9. ^ Vapnik, Vladimir N.: Invited Speaker. IPMU Information Processing and Management 2014).10. ^ Barghout, Lauren (2015). "Spatial-Taxon Information Granules as Used in Iterative Fuzzy-Decision-Making for Image Segmentation" (PDF). Granular Computing and Decision-Making. Studies in Big Data. Vol. 10. pp. 285–318. doi:10.1007/978-3-319-16829-6_12. ISBN 978-3-319-16828-9. S2CID 4154772. Archived from the original (PDF) on 2018-01-08. Retrieved 2018-01-08.11. ^ A. Maity (2016). "Supervised Classification of RADARSAT-2 Polarimetric Data for Different Land Features". arXiv:1608.00501 [cs.CV].12. ^ DeCoste, Dennis (2002). "Training Invariant Support Vector Machines" (PDF). Machine Learning. 46: 161–190. doi:10.1023/A:1012454411458. S2CID 85843.13. ^ Maitra, D. S.; Bhattacharya, U.; Parui, S. K. (August 2015). "CNN based common approach to handwritten character recognition of multiple scripts". 2015 13th International Conference on Document Analysis and Recognition (ICDAR): 1021–1025. doi:10.1109/ICDAR.2015.7333916. ISBN 978-1-4799-1805-8. S2CID 25739012.14. ^ Gaonkar, B.; Davatzikos, C. (2013). "Analytic estimation of statistical significance maps for support vector machine based multi-variate image analysis and classification". NeuroImage. 78: 270–283. doi:10.1016/j.neuroimage.2013.03.066. PMC 3767485. PMID 23583748.15. ^ Cuingnet, Rémi; Rosso, Charlotte; Chupin, Marie; Lehéricy, Stéphane; Dormont, Didier; Benali, Habib; Samson, Yves; Colliot, Olivier (2011). "Spatial regularization of SVM for the detection of diffusion alterations associated with stroke outcome" (PDF). Medical Image Analysis. 15 (5): 729–737. doi:10.1016/j.media.2011.05.007. PMID 21752695. Archived from the original (PDF) on 2018-12-22. Retrieved 2018-01-08.16. ^ Statnikov, Alexander; Hardin, Douglas; & Aliferis, Constantin; (2006); "Using SVM weight-based methods to identify causally relevant and non-causally relevant variables", Sign, 1, 4.17. ^ "Why is the SVM margin equal to $\frac{2}{\|\mathbf{w}\|}$[{\displaystyle {\frac {2}{\|\mathbf {w} \|}}}]". Mathematics Stack Exchange. 30 May 2015.18. ^ Aizerman, Mark A.; Braverman, Emmanuel M. & Rozonoer, Lev I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control. 25: 821–837.19. ^ Jin, Chi; Wang, Liwei (2012). Dimensionality dependent PAC-Bayes margin bound. Advances in Neural Information Processing Systems. CiteSeerX 10.1.1.420.3487. Archived from the original on 2015-04-02.20. ^ Shalev-Shwartz, Shai; Singer, Yoram; Srebro, Nathan; Cotter, Andrew (2010-10-16). "Pegasos: primal estimated sub-gradient solver for SVM". Mathematical Programming. 127 (1): 3–30. CiteSeerX 10.1.1.161.9629. doi:10.1007/s10107-010-0420-4. ISSN 0025-5610. S2CID 53306004.21. ^ Hsieh, Cho-Jui; Chang, Kai-Wei; Lin, Chih-Jen; Keerthi, S. Sathiya; Sundararajan, S. (2008-01-01). A Dual Coordinate Descent Method for Large-scale Linear SVM. Proceedings of the 25th International Conference on Machine Learning. ICML '08. New York, NY, USA: ACM. pp. 408–415. CiteSeerX 10.1.1.149.5594. doi:10.1145/1390156.1390208. ISBN 978-1-60558-205-4. S2CID 7880266.22. ^ Rosasco, Lorenzo; De Vito, Ernesto; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004-05-01). "Are Loss Functions All the Same?". Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786. doi:10.1162/089976604773135104. ISSN 0899-7667. PMID 15070510. S2CID 11845688.23. ^ Meyer, David; Leisch, Friedrich; Hornik, Kurt (September 2003). "The support vector machine under test". Neurocomputing. 55 (1–2): 169–186. doi:10.1016/S0925-2312(03)00431-4.24. ^ Hsu, Chih-Wei; Chang, Chih-Chung & Lin, Chih-Jen (2003). A Practical Guide to Support Vector Classification (PDF) (Technical report). Department of Computer Science and Information Engineering, National Taiwan University. Archived (PDF) from the original on 2013-06-25.25. ^ ^(a) ^(b) Duan, Kai-Bo; Keerthi, S. Sathiya (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study" (PDF). Multiple Classifier Systems. LNCS. Vol. 3541. pp. 278–285. CiteSeerX 10.1.1.110.6789. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.26. ^ Hsu, Chih-Wei & Lin, Chih-Jen (2002). "A Comparison of Methods for Multiclass Support Vector Machines" (PDF). IEEE Transactions on Neural Networks. 13 (2): 415–25. doi:10.1109/72.991427. PMID 18244442.27. ^ Platt, John; Cristianini, Nello; Shawe-Taylor, John (2000). "Large margin DAGs for multiclass classification" (PDF). In Solla, Sara A.; Leen, Todd K.; Müller, Klaus-Robert (eds.). Advances in Neural Information Processing Systems. MIT Press. pp. 547–553. Archived (PDF) from the original on 2012-06-16.28. ^ Dietterich, Thomas G.; Bakiri, Ghulum (1995). "Solving Multiclass Learning Problems via Error-Correcting Output Codes" (PDF). Journal of Artificial Intelligence Research. 2: 263–286. arXiv:cs/9501101. Bibcode:1995cs........1101D. doi:10.1613/jair.105. S2CID 47109072. Archived (PDF) from the original on 2013-05-09.29. ^ Crammer, Koby & Singer, Yoram (2001). "On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines" (PDF). Journal of Machine Learning Research. 2: 265–292. Archived (PDF) from the original on 2015-08-29.30. ^ Lee, Yoonkyung; Lin, Yi & Wahba, Grace (2001). "Multicategory Support Vector Machines" (PDF). Computing Science and Statistics. 33. Archived (PDF) from the original on 2013-06-17.31. ^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. CiteSeerX 10.1.1.22.1879. doi:10.1198/016214504000000098. S2CID 7066611.32. ^ Van den Burg, Gerrit J. J. & Groenen, Patrick J. F. (2016). "GenSVM: A Generalized Multiclass Support Vector Machine" (PDF). Journal of Machine Learning Research. 17 (224): 1–42.33. ^ Joachims, Thorsten. Transductive Inference for Text Classification using Support Vector Machines (PDF). Proceedings of the 1999 International Conference on Machine Learning (ICML 1999). pp. 200–209.34. ^ Drucker, Harris; Burges, Christ. C.; Kaufman, Linda; Smola, Alexander J.; and Vapnik, Vladimir N. (1997); "Support Vector Regression Machines", in Advances in Neural Information Processing Systems 9, NIPS 1996, 155–161, MIT Press.35. ^ Suykens, Johan A. K.; Vandewalle, Joos P. L.; "Least squares support vector machine classifiers", Neural Processing Letters, vol. 9, no. 3, Jun. 1999, pp. 293–300.36. ^ Smola, Alex J.; Schölkopf, Bernhard (2004). "A tutorial on support vector regression" (PDF). Statistics and Computing. 14 (3): 199–222. CiteSeerX 10.1.1.41.1452. doi:10.1023/B:STCO.0000035301.49549.88. S2CID 15475. Archived (PDF) from the original on 2012-01-31.37. ^ Polson, Nicholas G.; Scott, Steven L. (2011). "Data Augmentation for Support Vector Machines". Bayesian Analysis. 6 (1): 1–23. doi:10.1214/11-BA601.38. ^ Wenzel, Florian; Galy-Fajou, Theo; Deutsch, Matthäus; Kloft, Marius (2017). "Bayesian Nonlinear Support Vector Machines for Big Data". Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Lecture Notes in Computer Science. 10534: 307–322. arXiv:1707.05532. Bibcode:2017arXiv170705532W. doi:10.1007/978-3-319-71249-9_19. ISBN 978-3-319-71248-2. S2CID 4018290.39. ^ Florian Wenzel; Matthäus Deutsch; Théo Galy-Fajou; Marius Kloft; ”Scalable Approximate Inference for the Bayesian Nonlinear Support Vector Machine”40. ^ Ferris, Michael C.; Munson, Todd S. (2002). "Interior-Point Methods for Massive Support Vector Machines" (PDF). SIAM Journal on Optimization. 13 (3): 783–804. CiteSeerX 10.1.1.216.6893. doi:10.1137/S1052623400374379. S2CID 13563302. Archived (PDF) from the original on 2008-12-04.41. ^ Platt, John C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines (PDF). NIPS. Archived (PDF) from the original on 2015-07-02.42. ^ Shalev-Shwartz, Shai; Singer, Yoram; Srebro, Nathan (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM (PDF). ICML. Archived (PDF) from the original on 2013-12-15.43. ^ Fan, Rong-En; Chang, Kai-Wei; Hsieh, Cho-Jui; Wang, Xiang-Rui; Lin, Chih-Jen (2008). "LIBLINEAR: A library for large linear classification" (PDF). Journal of Machine Learning Research. 9: 1871–1874.44. ^ Allen Zhu, Zeyuan; Chen, Weizhu; Wang, Gang; Zhu, Chenguang; Chen, Zheng (2009). P-packSVM: Parallel Primal grAdient desCent Kernel SVM (PDF). ICDM. Archived (PDF) from the original on 2014-04-07.45. ^ Fan, Rong-En; Chang, Kai-Wei; Hsieh, Cho-Jui; Wang, Xiang-Rui; Lin, Chih-Jen (2008). "LIBLINEAR: A library for large linear classification". Journal of Machine Learning Research. 9 (Aug): 1871–1874.46. ^ Mohamad, Ismail; Usman, Dauda (2013-09-01). "Standardization and Its Effects on K-Means Clustering Algorithm". Research Journal of Applied Sciences, Engineering and Technology. 6 (17): 3299–3303. doi:10.19026/rjaset.6.3638.47. ^ Fennell, Peter; Zuo, Zhiya; Lerman, Kristina (2019-12-01). "Predicting and explaining behavioral data with structured feature space decomposition". EPJ Data Science. 8. doi:10.1140/epjds/s13688-019-0201-0.Further reading[edit]- Bennett, Kristin P.; Campbell, Colin (2000). "Support Vector Machines: Hype or Hallelujah?" (PDF). SIGKDD Explorations. 2 (2): 1–13. doi:10.1145/380995.380999. S2CID 207753020.- Cristianini, Nello; Shawe-Taylor, John (2000). An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press. ISBN 0-521-78019-5.- Fradkin, Dmitriy; Muchnik, Ilya (2006). "Support Vector Machines for Classification" (PDF). In Abello, J.; Carmode, G. (eds.). Discrete Methods in Epidemiology. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. Vol. 70. pp. 13–20.^([citation not found])- Joachims, Thorsten (1998). "Text categorization with Support Vector Machines: Learning with many relevant features". In Nédellec, Claire; Rouveirol, Céline (eds.). "Machine Learning: ECML-98. Lecture Notes in Computer Science. Vol. 1398. Berlin, Heidelberg: Springer. p. 137-142. doi:10.1007/BFb0026683. ISBN 978-3-540-64417-0.- Ivanciuc, Ovidiu (2007). "Applications of Support Vector Machines in Chemistry" (PDF). Reviews in Computational Chemistry. 23: 291–400. doi:10.1002/9780470116449.ch6. ISBN 9780470116449.- James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2013). "Support Vector Machines" (PDF). An Introduction to Statistical Learning : with Applications in R. New York: Springer. pp. 337–372. ISBN 978-1-4614-7137-0.- Schölkopf, Bernhard; Smola, Alexander J. (2002). Learning with Kernels. Cambridge, MA: MIT Press. ISBN 0-262-19475-9.- Steinwart, Ingo; Christmann, Andreas (2008). Support Vector Machines. New York: Springer. ISBN 978-0-387-77241-7.- Theodoridis, Sergios; Koutroumbas, Konstantinos (2009). Pattern Recognition (4th ed.). Academic Press. ISBN 978-1-59749-272-0.External links[edit]- libsvm, LIBSVM is a popular library of SVM learners- liblinear is a library for large linear classification including some SVMs- SVM light is a collection of software tools for learning and classification using SVM- SVMJS live demo is a GUI demo for JavaScript implementation of SVMs+-----------------------------------+-----------------------------------+| Authority control: National [Edit | - France || this at Wikidata] | - BnF data || | - Germany || | - Israel || | - United States || | - Czech Republic |+-----------------------------------+-----------------------------------+Statistical method of analysis which seeks to build a hierarchy ofclusters"SLINK" redirects here. For the online magazine, see Slink.+-----------------------------------------------------------------------+| Part of a series on |+-----------------------------------------------------------------------+| Machine learning || and data mining |+-----------------------------------------------------------------------+| [Scatterplot featuring a linear support vector machine's decision || boundary (dashed line)] |+-----------------------------------------------------------------------+| Paradigms || || - Supervised learning || - Unsupervised learning || - Online learning || - Batch learning || - Meta-learning || - Semi-supervised learning || - Self-supervised learning || - Reinforcement learning || - Rule-based learning || - Quantum machine learning |+-----------------------------------------------------------------------+| Problems || || - Classification || - Regression || - Clustering || - dimension reduction || - density estimation || - Anomaly detection || - Data Cleaning || - AutoML || - Association rules || - Semantic analysis || - Structured prediction || - Feature engineering || - Feature learning || - Learning to rank || - Grammar induction || - Ontology learning |+-----------------------------------------------------------------------+| Supervised learning || (classification • regression) || || - Decision trees || - Ensembles || - Bagging || - Boosting || - Random forest || - k-NN || - Linear regression || - Naive Bayes || - Artificial neural networks || - Logistic regression || - Perceptron || - Relevance vector machine (RVM) || - Support vector machine (SVM) |+-----------------------------------------------------------------------+| Clustering || || - BIRCH || - CURE || - Hierarchical || - k-means || - Fuzzy || - Expectation–maximization (EM) || - || DBSCAN || - OPTICS || - Mean shift |+-----------------------------------------------------------------------+| Dimensionality reduction || || - Factor analysis || - CCA || - ICA || - LDA || - NMF || - PCA || - PGD || - t-SNE || - SDL |+-----------------------------------------------------------------------+| Structured prediction || || - Graphical models || - Bayes net || - Conditional random field || - Hidden Markov |+-----------------------------------------------------------------------+| Anomaly detection || || - RANSAC || - k-NN || - Local outlier factor || - Isolation forest |+-----------------------------------------------------------------------+| Artificial neural network || || - Autoencoder || - Cognitive computing || - Deep learning || - DeepDream || - Multilayer perceptron || - RNN || - LSTM || - GRU || - ESN || - reservoir computing || - Restricted Boltzmann machine || - GAN || - SOM || - Convolutional neural network || - U-Net || - Transformer || - Vision || - Spiking neural network || - Memtransistor || - Electrochemical RAM (ECRAM) |+-----------------------------------------------------------------------+| Reinforcement learning || || - Q-learning || - SARSA || - Temporal difference (TD) || - Multi-agent || - Self-play |+-----------------------------------------------------------------------+| Learning with humans || || - Active learning || - Crowdsourcing || - Human-in-the-loop |+-----------------------------------------------------------------------+| Model diagnostics || || - Learning curve |+-----------------------------------------------------------------------+| Theory || || - Kernel machines || - Bias–variance tradeoff || - Computational learning theory || - Empirical risk minimization || - Occam learning || - PAC learning || - Statistical learning || - VC theory |+-----------------------------------------------------------------------+| Machine-learning venues || || - NeurIPS || - ICML || - ICLR || - ML || - JMLR |+-----------------------------------------------------------------------+| Related articles || || - Glossary of artificial intelligence || - List of datasets for machine-learning research || - Outline of machine learning |+-----------------------------------------------------------------------+| - v || - t || - e |+-----------------------------------------------------------------------+In data mining and statistics, hierarchical clustering (also calledhierarchical cluster analysis or HCA) is a method of cluster analysisthat seeks to build a hierarchy of clusters. Strategies for hierarchicalclustering generally fall into two categories:- Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.- Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.In general, the merges and splits are determined in a greedy manner. Theresults of hierarchical clustering^([1]) are usually presented in adendrogram.The standard algorithm for hierarchical agglomerative clustering (HAC)has a time complexity of 𝒪(n³)[{\mathcal {O}}(n^{3})] and requiresΩ(n²)[\Omega (n^{2})] memory, which makes it too slow for even mediumdata sets. However, for some special cases, optimal efficientagglomerative methods (of complexity 𝒪(n²)[{\mathcal {O}}(n^{2})]) areknown: SLINK^([2]) for single-linkage and CLINK^([3]) forcomplete-linkage clustering. With a heap, the runtime of the generalcase can be reduced to 𝒪(n²logn)[{\mathcal {O}}(n^{2}\log n)], animprovement on the aforementioned bound of 𝒪(n³)[{\mathcal {O}}(n^{3})],at the cost of further increasing the memory requirements. In manycases, the memory overheads of this approach are too large to make itpractically usable.Except for the special case of single-linkage, none of the algorithms(except exhaustive search in 𝒪(2^(n))[{\displaystyle {\mathcal{O}}(2^{n})}]) can be guaranteed to find the optimum solution.Divisive clustering with an exhaustive search is 𝒪(2^(n))[{\displaystyle{\mathcal {O}}(2^{n})}], but it is common to use faster heuristics tochoose splits, such as k-means.Hierarchical clustering has the distinct advantage that any validmeasure of distance can be used. In fact, the observations themselvesare not required: all that is used is a matrix of distances.Cluster Linkage[edit]In order to decide which clusters should be combined (foragglomerative), or where a cluster should be split (for divisive), ameasure of dissimilarity between sets of observations is required. Inmost methods of hierarchical clustering, this is achieved by use of anappropriate distance d, such as the Euclidean distance, between singleobservations of the data set, and a linkage criterion, which specifiesthe dissimilarity of sets as a function of the pairwise distances ofobservations in the sets. The choice of metric as well as linkage canhave a major impact on the result of the clustering, where the lowerlevel metric determines which objects are most similar, whereas thelinkage criterion influences the shape of the clusters. For example,complete-linkage tends to produce more spherical clusters thansingle-linkage.The linkage criterion determines the distance between sets ofobservations as a function of the pairwise distances betweenobservations.Some commonly used linkage criteria between two sets of observations Aand B and a distance d are:^([4][5]) ---------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Names Formula Maximum or complete-linkage clustering $\max\limits_{a \in A,\, b \in B}d(a,b)$[{\displaystyle \max _{a\in A,\,b\in B}d(a,b)}] Minimum or single-linkage clustering $\min\limits_{a \in A,\, b \in B}d(a,b)$[{\displaystyle \min _{a\in A,\,b\in B}d(a,b)}] Unweighted average linkage clustering (or UPGMA) $\frac{1}{|A| \cdot |B|}\sum\limits_{a \in A}\sum\limits_{b \in B}d(a,b).$[{\displaystyle {\frac {1}{|A|\cdot |B|}}\sum _{a\in A}\sum _{b\in B}d(a,b).}] Weighted average linkage clustering (or WPGMA) $d(i \cup j,k) = \frac{d(i,k) + d(j,k)}{2}.$[{\displaystyle d(i\cup j,k)={\frac {d(i,k)+d(j,k)}{2}}.}] Centroid linkage clustering, or UPGMC ∥μ_(A) − μ_(B)∥²[{\displaystyle \lVert \mu _{A}-\mu _{B}\rVert ^{2}}] where μ_(A)[\mu _{A}] and μ_(B)[\mu _{B}] are the centroids of A resp. B. Median linkage clustering, or WPGMC d(i∪j,k) = d(m_(i ∪ j),m_(k))[{\displaystyle d(i\cup j,k)=d(m_{i\cup j},m_{k})}] where $m_{i \cup j} = \frac{1}{2}\left( {m_{i} + m_{j}} \right)$[{\displaystyle m_{i\cup j}={\tfrac {1}{2}}\left(m_{i}+m_{j}\right)}] Versatile linkage clustering^([6]) $\sqrt[p]{\frac{1}{|A| \cdot |B|}\sum\limits_{a \in A}\sum\limits_{b \in B}d(a,b)^{p}},p \neq 0$[{\displaystyle {\sqrt[{p}]{{\frac {1}{|A|\cdot |B|}}\sum _{a\in A}\sum _{b\in B}d(a,b)^{p}}},p\neq 0}] Ward linkage,^([7]) Minimum Increase of Sum of Squares (MISSQ)^([8]) $\frac{|A| \cdot |B|}{|A \cup B|}\|\mu_{A} - \mu_{B}\|^{2} = \sum\limits_{x \in A \cup B}\| x - \mu_{A \cup B}\|^{2} - \sum\limits_{x \in A}\| x - \mu_{A}\|^{2} - \sum\limits_{x \in B}\| x - \mu_{B}\|^{2}$[{\displaystyle {\frac {|A|\cdot |B|}{|A\cup B|}}\lVert \mu _{A}-\mu _{B}\rVert ^{2}=\sum _{x\in A\cup B}\lVert x-\mu _{A\cup B}\rVert ^{2}-\sum _{x\in A}\lVert x-\mu _{A}\rVert ^{2}-\sum _{x\in B}\lVert x-\mu _{B}\rVert ^{2}}] Minimum Error Sum of Squares (MNSSQ)^([8]) $\sum\limits_{x \in A \cup B}\| x - \mu_{A \cup B}\|^{2}$[{\displaystyle \sum _{x\in A\cup B}\lVert x-\mu _{A\cup B}\rVert ^{2}}] Minimum Increase in Variance (MIVAR)^([8]) $\frac{1}{|A \cup B|}\sum\limits_{x \in A \cup B}\| x - \mu_{A \cup B}\|^{2} - \frac{1}{|A|}\sum\limits_{x \in A}\| x - \mu_{A}\|^{2} - \frac{1}{|B|}\sum\limits_{x \in B}\| x - \mu_{B}\|^{2}$[{\displaystyle {\frac {1}{|A\cup B|}}\sum _{x\in A\cup B}\lVert x-\mu _{A\cup B}\rVert ^{2}-{\frac {1}{|A|}}\sum _{x\in A}\lVert x-\mu _{A}\rVert ^{2}-{\frac {1}{|B|}}\sum _{x\in B}\lVert x-\mu _{B}\rVert ^{2}}] = Var(A∪B) − Var(A) − Var(AB)[{\displaystyle ={\text{Var}}(A\cup B)-{\text{Var}}(A)-{\text{Var}}(AB)}] Minimum Variance (MNVAR)^([8]) $\frac{1}{|A \cup B|}\sum\limits_{x \in A \cup B}\| x - \mu_{A \cup B}\|^{2} = \text{Var}(A \cup B)$[{\displaystyle {\frac {1}{|A\cup B|}}\sum _{x\in A\cup B}\lVert x-\mu _{A\cup B}\rVert ^{2}={\text{Var}}(A\cup B)}] Mini-Max linkage^([9]) $\min\limits_{x \in A \cup B}\max\limits_{y \in A \cup B}d(x,y)$[{\displaystyle \min _{x\in A\cup B}\max _{y\in A\cup B}d(x,y)}] Hausdorff linkage^([10]) $\max\limits_{x \in A \cup B}\min\limits_{y \in A \cup B}d(x,y)$[{\displaystyle \max _{x\in A\cup B}\min _{y\in A\cup B}d(x,y)}] Minimum Sum Medoid linkage^([11]) $\min\limits_{m \in A \cup B}\sum\limits_{y \in A \cup B}d(m,y)$[{\displaystyle \min _{m\in A\cup B}\sum _{y\in A\cup B}d(m,y)}] such that m is the medoid of the resulting cluster Minimum Sum Increase Medoid linkage^([11]) $\min\limits_{m \in A \cup B}\sum\limits_{y \in A \cup B}d(m,y) - \min\limits_{m \in A}\sum\limits_{y \in A}d(m,y) - \min\limits_{m \in B}\sum\limits_{y \in B}d(m,y)$[{\displaystyle \min _{m\in A\cup B}\sum _{y\in A\cup B}d(m,y)-\min _{m\in A}\sum _{y\in A}d(m,y)-\min _{m\in B}\sum _{y\in B}d(m,y)}] Medoid linkage^([12][13]) d(m_(A),m_(B))[{\displaystyle d(m_{A},m_{B})}] where m_(A)[m_{A}], m_(B)[m_{B}] are the medoids of the previous clusters Minimum energy clustering $\frac{2}{nm}\sum\limits_{i,j = 1}^{n,m}\| a_{i} - b_{j}\|_{2} - \frac{1}{n^{2}}\sum\limits_{i,j = 1}^{n}\| a_{i} - a_{j}\|_{2} - \frac{1}{m^{2}}\sum\limits_{i,j = 1}^{m}\| b_{i} - b_{j}\|_{2}$[{\frac {2}{nm}}\sum _{{i,j=1}}^{{n,m}}\|a_{i}-b_{j}\|_{2}-{\frac {1}{n^{2}}}\sum _{{i,j=1}}^{{n}}\|a_{i}-a_{j}\|_{2}-{\frac {1}{m^{2}}}\sum _{{i,j=1}}^{{m}}\|b_{i}-b_{j}\|_{2}] ---------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Some of these can only be recomputed recursively (WPGMA, WPGMC), formany a recursive computation with Lance-Williams-equations is moreefficient, while for other (Mini-Max, Hausdorff, Medoid) the distanceshave to be computed with the slower full formula. Other linkage criteriainclude:- The probability that candidate clusters spawn from the same distribution function (V-linkage).- The product of in-degree and out-degree on a k-nearest-neighbour graph (graph degree linkage).^([14])- The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.^([15][16][17])Agglomerative clustering example[edit][]Raw dataFor example, suppose this data is to be clustered, and the Euclideandistance is the distance metric.The hierarchical clustering dendrogram would be:[]Traditional representationCutting the tree at a given height will give a partitioning clusteringat a selected precision. In this example, cutting after the second row(from the top) of the dendrogram will yield clusters {a} {b c} {d e}{f}. Cutting after the third row will yield clusters {a} {b c} {d e f},which is a coarser clustering, with a smaller number but largerclusters.This method builds the hierarchy from the individual elements byprogressively merging clusters. In our example, we have six elements {a}{b} {c} {d} {e} and {f}. The first step is to determine which elementsto merge in a cluster. Usually, we want to take the two closestelements, according to the chosen distance.Optionally, one can also construct a distance matrix at this stage,where the number in the i-th row j-th column is the distance between thei-th and j-th elements. Then, as clustering progresses, rows and columnsare merged as the clusters are merged and the distances updated. This isa common way to implement this type of clustering, and has the benefitof caching distances between clusters. A simple agglomerative clusteringalgorithm is described in the single-linkage clustering page; it caneasily be adapted to different types of linkage (see below).Suppose we have merged the two closest elements b and c, we now have thefollowing clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge themfurther. To do that, we need to take the distance between {a} and {b c},and therefore define the distance between two clusters.Usually thedistance between two clusters 𝒜[{\mathcal {A}}] and ℬ[{\mathcal {B}}] isone of the following:- The maximum distance between elements of each cluster (also called complete-linkage clustering):max { d(x,y) : x ∈ 𝒜, y ∈ ℬ }.[\max\{\,d(x,y):x\in {\mathcal {A}},\,y\in{\mathcal {B}}\,\}.]- The minimum distance between elements of each cluster (also called single-linkage clustering):min { d(x,y) : x ∈ 𝒜, y ∈ ℬ }.[\min\{\,d(x,y):x\in {\mathcal {A}},\,y\in{\mathcal {B}}\,\}.]- The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):$\frac{1}{|\mathcal{A}| \cdot |\mathcal{B}|}\sum\limits_{x \in \mathcal{A}}\sum\limits_{y \in \mathcal{B}}d(x,y).$[{1\over {|{\mathcal {A}}|\cdot |{\mathcal {B}}|}}\sum _{{x\in {\mathcal{A}}}}\sum _{{y\in {\mathcal {B}}}}d(x,y).]- The sum of all intra-cluster variance.- The increase in variance for the cluster being merged (Ward's method^([7]))- The probability that candidate clusters spawn from the same distribution function (V-linkage).In case of tied minimum distances, a pair is randomly chosen, thus beingable to generate several structurally different dendrograms.Alternatively, all tied pairs may be joined at the same time, generatinga unique dendrogram.^([18])One can always decide to stop clustering when there is a sufficientlysmall number of clusters (number criterion). Some linkages may alsoguarantee that agglomeration occurs at a greater distance betweenclusters than the previous agglomeration, and then one can stopclustering when the clusters are too far apart to be merged (distancecriterion). However, this is not the case of, e.g., the centroid linkagewhere the so-called reversals^([19]) (inversions, departures fromultrametricity) may occur.Divisive clustering[edit]The basic principle of divisive clustering was published as the DIANA(DIvisive ANAlysis Clustering) algorithm.^([20]) Initially, all data isin the same cluster, and the largest cluster is split until every objectis separate.Because there exist O(2^(n))[O(2^{n})] ways of splittingeach cluster, heuristics are needed. DIANA chooses the object with themaximum average dissimilarity and then moves all objects to this clusterthat are more similar to the new cluster than to the remainder.Software[edit]Open source implementations[edit][]Hierarchical clustering dendrogram of the Iris dataset (using R). Source[]Hierarchical clustering and interactive dendrogram visualization inOrange data mining suite.- ALGLIB implements several hierarchical clustering algorithms (single-link, complete-link, Ward) in C++ and C# with O(n²) memory and O(n³) run time.- ELKI includes multiple hierarchical clustering algorithms, various linkage strategies and also includes the efficient SLINK,^([2]) CLINK^([3]) and Anderberg algorithms, flexible cluster extraction from dendrograms and various other cluster analysis algorithms.- Julia has an implementation inside the Clustering.jl package.^([21])- Octave, the GNU analog to MATLAB implements hierarchical clustering in function "linkage".- Orange, a data mining software suite, includes hierarchical clustering with interactive dendrogram visualisation.- R has built-in functions^([22]) and packages that provide functions for hierarchical clustering.^([23][24][25])- SciPy implements hierarchical clustering in Python, including the efficient SLINK algorithm.- scikit-learn also implements hierarchical clustering in Python.- Weka includes hierarchical cluster analysis.Commercial implementations[edit]- MATLAB includes hierarchical cluster analysis.- SAS includes hierarchical cluster analysis in PROC CLUSTER.- Mathematica includes a Hierarchical Clustering Package.- NCSS includes hierarchical cluster analysis.- SPSS includes hierarchical cluster analysis.- Qlucore Omics Explorer includes hierarchical cluster analysis.- Stata includes hierarchical cluster analysis.- CrimeStat includes a nearest neighbor hierarchical cluster algorithm with a graphical output for a Geographic Information System.See also[edit]- Binary space partitioning- Bounding volume hierarchy- Brown clustering- Cladistics- Cluster analysis- Computational phylogenetics- CURE data clustering algorithm- Dasgupta's objective- Dendrogram- Determining the number of clusters in a data set- Hierarchical clustering of networks- Locality-sensitive hashing- Nearest neighbor search- Nearest-neighbor chain algorithm- Numerical taxonomy- OPTICS algorithm- Statistical distance- Persistent homologyReferences[edit]1. ^ Nielsen, Frank (2016). "8. Hierarchical Clustering". Introduction to HPC with MPI for Data Science. Springer. pp. 195–211. ISBN 978-3-319-21903-5.2. ^ ^(a) ^(b) R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method" (PDF). The Computer Journal. British Computer Society. 16 (1): 30–34. doi:10.1093/comjnl/16.1.30.3. ^ ^(a) ^(b) D. Defays (1977). "An efficient algorithm for a complete-link method". The Computer Journal. British Computer Society. 20 (4): 364–6. doi:10.1093/comjnl/20.4.364.4. ^ "The CLUSTER Procedure: Clustering Methods". SAS/STAT 9.2 Users Guide. SAS Institute. Retrieved 2009-04-26.5. ^ Székely, G. J.; Rizzo, M. L. (2005). "Hierarchical clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method". Journal of Classification. 22 (2): 151–183. doi:10.1007/s00357-005-0012-9. S2CID 206960007.6. ^ Fernández, Alberto; Gómez, Sergio (2020). "Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering". Journal of Classification. 37 (3): 584–597. arXiv:1906.09222. doi:10.1007/s00357-019-09339-z. S2CID 195317052.7. ^ ^(a) ^(b) Ward, Joe H. (1963). "Hierarchical Grouping to Optimize an Objective Function". Journal of the American Statistical Association. 58 (301): 236–244. doi:10.2307/2282967. JSTOR 2282967. MR 0148188.8. ^ ^(a) ^(b) ^(c) ^(d) Podani, János (1989), Mucina, L.; Dale, M. B. (eds.), "New combinatorial clustering methods", Numerical syntaxonomy, Dordrecht: Springer Netherlands, pp. 61–77, doi:10.1007/978-94-009-2432-1_5, ISBN 978-94-009-2432-1, retrieved 2022-11-049. ^ Ao, S. I.; Yip, K.; Ng, M.; Cheung, D.; Fong, P.-Y.; Melhado, I.; Sham, P. C. (2004-12-07). "CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs". Bioinformatics. 21 (8): 1735–1736. doi:10.1093/bioinformatics/bti201. ISSN 1367-4803. PMID 15585525.10. ^ Basalto, Nicolas; Bellotti, Roberto; De Carlo, Francesco; Facchi, Paolo; Pantaleo, Ester; Pascazio, Saverio (2007-06-15). "Hausdorff clustering of financial time series". Physica A: Statistical Mechanics and Its Applications. 379 (2): 635–644. arXiv:physics/0504014. Bibcode:2007PhyA..379..635B. doi:10.1016/j.physa.2007.01.011. ISSN 0378-4371. S2CID 27093582.11. ^ ^(a) ^(b) Schubert, Erich (2021). HACAM: Hierarchical Agglomerative Clustering Around Medoids – and its Limitations (PDF). LWDA’21: Lernen, Wissen, Daten, Analysen September 01–03, 2021, Munich, Germany. pp. 191–204 – via CEUR-WS.12. ^ Miyamoto, Sadaaki; Kaizu, Yousuke; Endo, Yasunori (2016). Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures. 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS). pp. 400–403. doi:10.1109/SCIS-ISIS.2016.0091.13. ^ Herr, Dominik; Han, Qi; Lohmann, Steffen; Ertl, Thomas (2016). Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data (PDF). Graphics Interface. Graphics Interface. doi:10.20380/gi2016.14. Retrieved 2022-11-04.14. ^ Zhang, Wei; Wang, Xiaogang; Zhao, Deli; Tang, Xiaoou (2012). Fitzgibbon, Andrew; Lazebnik, Svetlana; Perona, Pietro; Sato, Yoichi; Schmid, Cordelia (eds.). "Graph Degree Linkage: Agglomerative Clustering on a Directed Graph". Computer Vision – ECCV 2012. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 7572: 428–441. arXiv:1208.5092. Bibcode:2012arXiv1208.5092Z. doi:10.1007/978-3-642-33718-5_31. ISBN 9783642337185. S2CID 14751. See also: https://github.com/waynezhanghk/gacluster15. ^ Zhang, W.; Zhao, D.; Wang, X. (2013). "Agglomerative clustering via maximum incremental path integral". Pattern Recognition. 46 (11): 3056–65. Bibcode:2013PatRe..46.3056Z. CiteSeerX 10.1.1.719.5355. doi:10.1016/j.patcog.2013.04.013.16. ^ Zhao, D.; Tang, X. (2008). "Cyclizing clusters via zeta function of a graph". NIPS'08: Proceedings of the 21st International Conference on Neural Information Processing Systems. pp. 1953–60. CiteSeerX 10.1.1.945.1649. ISBN 9781605609492.17. ^ Ma, Y.; Derksen, H.; Hong, W.; Wright, J. (2007). "Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression". IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (9): 1546–62. doi:10.1109/TPAMI.2007.1085. hdl:2142/99597. PMID 17627043. S2CID 4591894.18. ^ Fernández, Alberto; Gómez, Sergio (2008). "Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms". Journal of Classification. 25 (1): 43–65. arXiv:cs/0608049. doi:10.1007/s00357-008-9004-x. S2CID 434036.19. ^ Legendre, P.; Legendre, L.F.J. (2012). "Cluster Analysis §8.6 Reversals". Numerical Ecology. Developments in Environmental Modelling. Vol. 24 (3rd ed.). Elsevier. pp. 376–7. ISBN 978-0-444-53868-0.20. ^ Kaufman, L.; Rousseeuw, P.J. (2009) [1990]. "6. Divisive Analysis (Program DIANA)". Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. pp. 253–279. ISBN 978-0-470-31748-8.21. ^ "Hierarchical Clustering · Clustering.jl". juliastats.org. Retrieved 2022-02-28.22. ^ "hclust function - RDocumentation". www.rdocumentation.org. Retrieved 2022-06-07.23. ^ Galili, Tal; Benjamini, Yoav; Simpson, Gavin; Jefferis, Gregory (2021-10-28), dendextend: Extending 'dendrogram' Functionality in R, retrieved 2022-06-0724. ^ Paradis, Emmanuel; et al. "ape: Analyses of Phylogenetics and Evolution". Retrieved 2022-12-28.25. ^ Fernández, Alberto; Gómez, Sergio (2021-09-12). "mdendro: Extended Agglomerative Hierarchical Clustering". Retrieved 2022-12-28.Further reading[edit]- Kaufman, L.; Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (1 ed.). New York: John Wiley. ISBN 0-471-87876-6.- Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "14.3.12 Hierarchical clustering". The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 520–8. ISBN 978-0-387-84857-0. Archived from the original (PDF) on 2009-11-10. Retrieved 2009-10-20.+-----------------------------------+-----------------------------------+| Authority control: National [Edit | - Israel || this at Wikidata] | - United States |+-----------------------------------+-----------------------------------+[]Look up Python or python in Wiktionary, the free dictionary.Python may refer to:Snakes[edit]- Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia - Python (genus), a genus of Pythonidae found in Africa and Asia- Python (mythology), a mythical serpentComputing[edit]- Python (programming language), a widely used programming language- Python, a native code compiler for CMU Common Lisp- Python, the internal project name for the PERQ 3 computer workstationPeople[edit]- Python of Aenus (4th-century BCE), student of Plato- Python (painter), (ca. 360–320 BCE) vase painter in Poseidonia- Python of Byzantium, orator, diplomat of Philip II of Macedon- Python of Catana, poet who accompanied Alexander the Great- Python Anghelo (1954–2014) Romanian graphic artistRoller coasters[edit]- Python (Efteling), a roller coaster in the Netherlands- Python (Busch Gardens Tampa Bay), a defunct roller coaster- Python (Coney Island, Cincinnati, Ohio), a steel roller coasterVehicles[edit]- Python (automobile maker), an Australian car company- Python (Ford prototype), a Ford prototype sports carWeaponry[edit]- Python (missile), a series of Israeli air-to-air missiles- Python (nuclear primary), a gas-boosted fission primary used in thermonuclear weapons- Colt Python, a revolverOther uses[edit]- Python (codename), a British nuclear war contingency plan- Python (film), a 2000 horror film by Richard Clabaugh- Monty Python or the Pythons, a British comedy group - Python (Monty) Pictures, a company owned by the troupe's surviving members- Python, a work written by philosopher Timon of PhliusSee also[edit]- Pyton, a Norwegian magazine- Pithon, a commune in northern FranceTopics referred to by the same term[Disambiguation icon]This disambiguation page lists articles associated with the titlePython.If an internal link led you here, you may wish to change the link topoint directly to the intended article.High-level programming languageNot to be confused with Java (programming language), Javanese script, orECMAScript.".js" redirects here. For the Microsoft dialect used in InternetExplorer, see JScript.For the uses of JavaScript on Wikipedia, see Wikipedia:WikiProjectJavaScript.+-----------------------------------+-----------------------------------+| [JavaScript code.png] | || | || Screenshot of JavaScript source | || code | |+-----------------------------------+-----------------------------------+| | |+-----------------------------------+-----------------------------------+| Paradigm | Multi-paradigm: event-driven, || | functional, imperative, || | procedural, object-oriented || | programming |+-----------------------------------+-----------------------------------+| Designed by | Brendan Eich of Netscape || | initially; others have also || | contributed to the ECMAScript || | standard |+-----------------------------------+-----------------------------------+| First appeared | December 4, 1995; 27 years || | ago (1995-12-04)^([1]) |+-----------------------------------+-----------------------------------+| | |+-----------------------------------+-----------------------------------+| Stable release | ECMAScript 2021^([2]) [Edit this || | on Wikidata] / June 2021; 22 || | months ago (June 2021) |+-----------------------------------+-----------------------------------+| Preview release | ECMAScript 2022^([3]) [Edit this || | on Wikidata] / 22 July 2021; 20 || | months ago (22 July 2021) |+-----------------------------------+-----------------------------------+| | |+-----------------------------------+-----------------------------------+| Typing discipline | Dynamic, weak, duck |+-----------------------------------+-----------------------------------+| Filename extensions | - .js || | - .cjs || | - .mjs^([4]) |+-----------------------------------+-----------------------------------+| Website | www.ecm || | a-international.org/publications- || | and-standards/standards/ecma-262/ |+-----------------------------------+-----------------------------------+| Major implementations | |+-----------------------------------+-----------------------------------+| V8, JavaScriptCore, SpiderMonkey, | || Chakra | |+-----------------------------------+-----------------------------------+| Influenced by | |+-----------------------------------+-----------------------------------+| Java,^([5][6]) Scheme,^([6]) | || Self,^([7]) AWK,^([8]) | || HyperTalk^([9]) | |+-----------------------------------+-----------------------------------+| Influenced | |+-----------------------------------+-----------------------------------+| ActionScript, AssemblyScript, | || CoffeeScript, Dart, Haxe, JS++, | || Opa, TypeScript | |+-----------------------------------+-----------------------------------+| - [] JavaScript at Wikibooks | |+-----------------------------------+-----------------------------------+: JavaScriptJavaScript (/ˈdʒɑːvəskrɪpt/), often abbreviated as JS, is a programminglanguage that is one of the core technologies of the World Wide Web,alongside HTML and CSS. As of 2022, 98% of websites use JavaScript onthe client side for webpage behavior, often incorporating third-partylibraries. All major web browsers have a dedicated JavaScript engine toexecute the code on users' devices.JavaScript is a high-level, often just-in-time compiled language thatconforms to the ECMAScript standard.^([10]) It has dynamic typing,prototype-based object-orientation, and first-class functions. It ismulti-paradigm, supporting event-driven, functional, and imperativeprogramming styles. It has application programming interfaces (APIs) forworking with text, dates, regular expressions, standard data structures,and the Document Object Model (DOM).The ECMAScript standard does not include any input/output (I/O), such asnetworking, storage, or graphics facilities. In practice, the webbrowser or other runtime system provides JavaScript APIs for I/O.JavaScript engines were originally used only in web browsers, but arenow core components of some servers and a variety of applications. Themost popular runtime system for this usage is Node.js.Although Java and JavaScript are similar in name, syntax, and respectivestandard libraries, the two languages are distinct and differ greatly indesign.History[edit]Creation at Netscape[edit]The first popular web browser with a graphical user interface, Mosaic,was released in 1993. Accessible to non-technical people, it played aprominent role in the rapid growth of the nascent World Wide Web.^([11])The lead developers of Mosaic then founded the Netscape corporation,which released a more polished browser, Netscape Navigator, in 1994.This quickly became the most-used.^([12][13])During these formative years of the Web, web pages could only be static,lacking the capability for dynamic behavior after the page was loaded inthe browser. There was a desire in the flourishing web development sceneto remove this limitation, so in 1995, Netscape decided to add ascripting language to Navigator. They pursued two routes to achievethis: collaborating with Sun Microsystems to embed the Java programminglanguage, while also hiring Brendan Eich to embed the Schemelanguage.^([6])Netscape management soon decided that the best option was for Eich todevise a new language, with syntax similar to Java and less like Schemeor other extant scripting languages.^([5][6]) Although the new languageand its interpreter implementation were called LiveScript when firstshipped as part of a Navigator beta in September 1995, the name waschanged to JavaScript for the official release in December.^([6][1][14])The choice of the JavaScript name has caused confusion, implying that itis directly related to Java. At the time, the dot-com boom had begun andJava was the hot new language, so Eich considered the JavaScript name amarketing ploy by Netscape.^([15])Adoption by Microsoft[edit]Microsoft debuted Internet Explorer in 1995, leading to a browser warwith Netscape. On the JavaScript front, Microsoft reverse-engineered theNavigator interpreter to create its own, called JScript.^([16])JScript was first released in 1996, alongside initial support for CSSand extensions to HTML. Each of these implementations was noticeablydifferent from their counterparts in Navigator.^([17][18]) Thesedifferences made it difficult for developers to make their websites workwell in both browsers, leading to widespread use of "best viewed inNetscape" and "best viewed in Internet Explorer" logos for severalyears.^([17][19])The rise of JScript[edit]In November 1996, Netscape submitted JavaScript to Ecma International,as the starting point for a standard specification that all browservendors could conform to. This led to the official release of the firstECMAScript language specification in June 1997.The standards process continued for a few years, with the release ofECMAScript 2 in June 1998 and ECMAScript 3 in December 1999. Work onECMAScript 4 began in 2000.^([16])Meanwhile, Microsoft gained an increasingly dominant position in thebrowser market. By the early 2000s, Internet Explorer's market sharereached 95%.^([20]) This meant that JScript became the de facto standardfor client-side scripting on the Web.Microsoft initially participated in the standards process andimplemented some proposals in its JScript language, but eventually itstopped collaborating on Ecma work. Thus ECMAScript 4 was mothballed.Growth and standardization[edit]During the period of Internet Explorer dominance in the early 2000s,client-side scripting was stagnant. This started to change in 2004, whenthe successor of Netscape, Mozilla, released the Firefox browser.Firefox was well received by many, taking significant market share fromInternet Explorer.^([21])In 2005, Mozilla joined ECMA International, and work started on theECMAScript for XML (E4X) standard. This led to Mozilla working jointlywith Macromedia (later acquired by Adobe Systems), who were implementingE4X in their ActionScript 3 language, which was based on an ECMAScript 4draft. The goal became standardizing ActionScript 3 as the newECMAScript 4. To this end, Adobe Systems released the Tamarinimplementation as an open source project. However, Tamarin andActionScript 3 were too different from established client-sidescripting, and without cooperation from Microsoft, ECMAScript 4 neverreached fruition.Meanwhile, very important developments were occurring in open-sourcecommunities not affiliated with ECMA work. In 2005, Jesse James Garrettreleased a white paper in which he coined the term Ajax and described aset of technologies, of which JavaScript was the backbone, to create webapplications where data can be loaded in the background, avoiding theneed for full page reloads. This sparked a renaissance period ofJavaScript, spearheaded by open-source libraries and the communitiesthat formed around them. Many new libraries were created, includingjQuery, Prototype, Dojo Toolkit, and MooTools.Google debuted its Chrome browser in 2008, with the V8 JavaScript enginethat was faster than its competition.^([22][23]) The key innovation wasjust-in-time compilation (JIT),^([24]) so other browser vendors neededto overhaul their engines for JIT.^([25])In July 2008, these disparate parties came together for a conference inOslo. This led to the eventual agreement in early 2009 to combine allrelevant work and drive the language forward. The result was theECMAScript 5 standard, released in December 2009.Reaching maturity[edit]Ambitious work on the language continued for several years, culminatingin an extensive collection of additions and refinements being formalizedwith the publication of ECMAScript 6 in 2015.^([26])The creation of Node.js in 2009 by Ryan Dahl sparked a significantincrease in the usage of JavaScript outside of web browsers. Nodecombines the V8 engine, an event loop, and I/O APIs, thereby providing astand-alone JavaScript runtime system.^([27][28]) As of 2018, Node hadbeen used by millions of developers,^([29]) and npm had the most modulesof any package manager in the world.^([30])The ECMAScript draft specification is currently maintained openly onGitHub, and editions are produced via regular annual snapshots.^([31])Potential revisions to the language are vetted through a comprehensiveproposal process.^([32][33]) Now, instead of edition numbers, developerscheck the status of upcoming features individually.^([31])The current JavaScript ecosystem has many libraries and frameworks,established programming practices, and substantial usage of JavaScriptoutside of web browsers. Plus, with the rise of single-page applicationsand other JavaScript-heavy websites, several transpilers have beencreated to aid the development process.^([34])Trademark[edit]"JavaScript" is a trademark of Oracle Corporation in the UnitedStates.^([35][36]) The trademark was originally issued to SunMicrosystems on 6 May 1997, and was transferred to Oracle when theyacquired Sun in 2009.^([37])Website client-side usage[edit]JavaScript is the dominant client-side scripting language of the Web,with 98% of all websites (mid–2022) using it for this purpose.^([38])Scripts are embedded in or included from HTML documents and interactwith the DOM. All major web browsers have a built-in JavaScript enginethat executes the code on the user's device.Examples of scripted behavior[edit]- Loading new web page content without reloading the page, via Ajax or a WebSocket. For example, users of social media can send and receive messages without leaving the current page.- Web page animations, such as fading objects in and out, resizing, and moving them.- Playing browser games.- Controlling the playback of streaming media.- Generating pop-up ads or alert boxes.- Validating input values of a web form before the data is sent to a web server.- Logging data about the user's behavior then sending it to a server. The website owner can use this data for analytics, ad tracking, and personalization.- Redirecting a user to another page.- Storing and retrieving data on the user's device, via the storage or IndexedDB standards.Web libraries and frameworks[edit]Over 80% of websites use a third-party JavaScript library or webframework for their client-side scripting.^([39])jQuery is by far the most popular client-side library, used by over 75%of websites.^([39]) Facebook created the React library for its websiteand later released it as open source; other sites, including Twitter,now use it. Likewise, the Angular framework created by Google for itswebsites, including YouTube and Gmail, is now an open source projectused by others.^([39])In contrast, the term "Vanilla JS" has been coined for websites notusing any libraries or frameworks, instead relying entirely on standardJavaScript functionality.^([40])Other usage[edit]The use of JavaScript has expanded beyond its web browser roots.JavaScript engines are now embedded in a variety of other softwaresystems, both for server-side website deployments and non-browserapplications.Initial attempts at promoting server-side JavaScript usage were NetscapeEnterprise Server and Microsoft's Internet InformationServices,^([41][42]) but they were small niches.^([43]) Server-sideusage eventually started to grow in the late 2000s, with the creation ofNode.js and other approaches.^([43])Electron, Cordova, React Native, and other application frameworks havebeen used to create many applications with behavior implemented inJavaScript. Other non-browser applications include Adobe Acrobat supportfor scripting PDF documents^([44]) and GNOME Shell extensions written inJavaScript.^([45])JavaScript has recently begun to appear in some embedded systems,usually by leveraging Node.js.^([46][47][48])Features[edit]The following features are common to all conforming ECMAScriptimplementations unless explicitly specified otherwise.Imperative and structured[edit]JavaScript supports much of the structured programming syntax from C(e.g., if statements, while loops, switch statements, do while loops,etc.). One partial exception is scoping: originally JavaScript only hadfunction scoping with var; block scoping was added in ECMAScript 2015with the keywords let and const. Like C, JavaScript makes a distinctionbetween expressions and statements. One syntactic difference from C isautomatic semicolon insertion, which allow semicolons (which terminatestatements) to be omitted.^([49])Weakly typed[edit]JavaScript is weakly typed, which means certain types are implicitlycast depending on the operation used.^([50])- The binary + operator casts both operands to a string unless both operands are numbers. This is because the addition operator doubles as a concatenation operator- The binary - operator always casts both operands to a number- Both unary operators (+, -) always cast the operand to a numberValues are cast to strings like the following:^([50])- Strings are left as-is- Numbers are converted to their string representation- Arrays have their elements cast to strings after which they are joined by commas (,)- Other objects are converted to the string [object Object] where Object is the name of the constructor of the objectValues are cast to numbers by casting to strings and then casting thestrings to numbers. These processes can be modified by defining toStringand valueOf functions on the prototype for string and number castingrespectively.JavaScript has received criticism for the way it implements theseconversions as the complexity of the rules can be mistaken forinconsistency.^([51][50]) For example, when adding a number to a string,the number will be cast to a string before performing concatenation, butwhen subtracting a number from a string, the string is cast to a numberbefore performing subtraction. ------------------ ---------- ------------------- ---------------------------- left operand operator right operand result [] (empty array) + [] (empty array) "" (empty string) [] (empty array) + {} (empty object) "[object Object]" (string) false (boolean) + [] (empty array) "false" (string) "123"(string) + 1 (number) "1231" (string) "123" (string) - 1 (number) 122 (number) "123" (string) - "abc" (string) NaN (number) ------------------ ---------- ------------------- ---------------------------- : JavaScript type conversionsOften also mentioned is {} + [] resulting in 0 (number). This ismisleading: the {} is interpreted as an empty code block instead of anempty object, and the empty array is cast to a number by the remainingunary + operator. If you wrap the expression in parentheses ({} + [])the curly brackets are interpreted as an empty object and the result ofthe expression is "[object Object]" as expected.^([50])Dynamic[edit]Typing JavaScript is dynamically typed like most other scripting languages. A type is associated with a value rather than an expression. For example, a variable initially bound to a number may be reassigned to a string.^([52]) JavaScript supports various ways to test the type of objects, including duck typing.^([53])Run-time evaluation JavaScript includes an eval function that can execute statements provided as strings at run-time.Object-orientation (prototype-based)[edit]Prototypal inheritance in JavaScript is described by Douglas Crockfordas: You make prototype objects, and then ... make new instances. Objects are mutable in JavaScript, so we can augment the new instances, giving them new fields and methods. These can then act as prototypes for even newer objects. We don't need classes to make lots of similar objects... Objects inherit from objects. What could be more object oriented than that?^([54])In JavaScript, an object is an associative array, augmented with aprototype (see below); each key provides the name for an objectproperty, and there are two syntactical ways to specify such a name: dotnotation (obj.x = 10) and bracket notation (obj['x'] = 10). A propertymay be added, rebound, or deleted at run-time. Most properties of anobject (and any property that belongs to an object's prototypeinheritance chain) can be enumerated using a for...in loop.Prototypes JavaScript uses prototypes where many other object-oriented languages use classes for inheritance.^([55]) It is possible to simulate many class-based features with prototypes in JavaScript.^([56])Functions as object constructors Functions double as object constructors, along with their typical role. Prefixing a function call with new will create an instance of a prototype, inheriting properties and methods from the constructor (including properties from the Object prototype).^([57]) ECMAScript 5 offers the Object.create method, allowing explicit creation of an instance without automatically inheriting from the Object prototype (older environments can assign the prototype to null).^([58]) The constructor's prototype property determines the object used for the new object's internal prototype. New methods can be added by modifying the prototype of the function used as a constructor. JavaScript's built-in constructors, such as Array or Object, also have prototypes that can be modified. While it is possible to modify the Object prototype, it is generally considered bad practice because most objects in JavaScript will inherit methods and properties from the Object prototype, and they may not expect the prototype to be modified.^([59])Functions as methods Unlike many object-oriented languages, there is no distinction between a function definition and a method definition. Rather, the distinction occurs during function calling: when a function is called as a method of an object, the function's local this keyword is bound to that object for that invocation.Functional[edit]JavaScript functions are first-class; a function is considered to be anobject.^([60]) As such, a function may have properties and methods, suchas .call() and .bind().^([61]) A nested function is a function definedwithin another function. It is created each time the outer function isinvoked. In addition, each nested function forms a lexical closure: thelexical scope of the outer function (including any constant, localvariable, or argument value) becomes part of the internal state of eachinner function object, even after execution of the outer functionconcludes.^([62]) JavaScript also supports anonymous functions.Delegative[edit]JavaScript supports implicit and explicit delegation.Functions as roles (Traits and Mixins) JavaScript natively supports various function-based implementations of Role^([63]) patterns like Traits^([64][65]) and Mixins.^([66]) Such a function defines additional behavior by at least one method bound to the this keyword within its function body. A Role then has to be delegated explicitly via call or apply to objects that need to feature additional behavior that is not shared via the prototype chain.Object composition and inheritance Whereas explicit function-based delegation does cover composition in JavaScript, implicit delegation already happens every time the prototype chain is walked in order to, e.g., find a method that might be related to but is not directly owned by an object. Once the method is found it gets called within this object's context. Thus inheritance in JavaScript is covered by a delegation automatism that is bound to the prototype property of constructor functions.Miscellaneous[edit]JavaScript is a zero-index language.Run-time environment JavaScript typically relies on a run-time environment (e.g., a web browser) to provide objects and methods by which scripts can interact with the environment (e.g., a web page DOM). These environments are single-threaded. JavaScript also relies on the run-time environment to provide the ability to include/import scripts (e.g., HTML <script> elements). This is not a language feature per se, but it is common in most JavaScript implementations. JavaScript processes messages from a queue one at a time. JavaScript calls a function associated with each new message, creating a call stack frame with the function's arguments and local variables. The call stack shrinks and grows based on the function's needs. When the call stack is empty upon function completion, JavaScript proceeds to the next message in the queue. This is called the event loop, described as "run to completion" because each message is fully processed before the next message is considered. However, the language's concurrency model describes the event loop as non-blocking: program input/output is performed using events and callback functions. This means, for instance, that JavaScript can process a mouse click while waiting for a database query to return information.^([67])Variadic functions An indefinite number of parameters can be passed to a function. The function can access them through formal parameters and also through the local arguments object. Variadic functions can also be created by using the bind method.Array and object literals Like many scripting languages, arrays and objects (associative arrays in other languages) can each be created with a succinct shortcut syntax. In fact, these literals form the basis of the JSON data format.Regular expressions JavaScript also supports regular expressions in a manner similar to Perl, which provide a concise and powerful syntax for text manipulation that is more sophisticated than the built-in string functions.^([68])Promises and Async/await JavaScript supports promises and Async/await for handling asynchronous operations. A built-in Promise object provides functionality for handling promises and associating handlers with an asynchronous action's eventual result. Recently, combinator methods were introduced in the JavaScript specification, which allows developers to combine multiple JavaScript promises and do operations based on different scenarios. The methods introduced are: Promise.race, Promise.all, Promise.allSettled and Promise.any. Async/await allows an asynchronous, non-blocking function to be structured in a way similar to an ordinary synchronous function. Asynchronous, non-blocking code can be written, with minimal overhead, structured similar to traditional synchronous, blocking code.Vendor-specific extensions[edit]Historically, some JavaScript engines supported these non-standardfeatures:- conditional catch clauses (like Java)- array comprehensions and generator expressions (like Python)- concise function expressions (function(args) expr; this experimental syntax predated arrow functions)- ECMAScript for XML (E4X), an extension that adds native XML support to ECMAScript (unsupported in Firefox since version 21^([69]))Syntax[edit]Main article: JavaScript syntaxSimple examples[edit]Variables in JavaScript can be defined using either the var,^([70])let^([71]) or const^([72]) keywords. Variables defined without keywordswill be defined at the global scope. // Declares a function-scoped variable named `x`, and implicitly assigns the// special value `undefined` to it. Variables without value are automatically// set to undefined.// var is generally considered bad practice and let and const are usually preferred.var x;// Variables can be manually set to `undefined` like solet x2 = undefined;// Declares a block-scoped variable named `y`, and implicitly sets it to// `undefined`. The `let` keyword was introduced in ECMAScript 2015.let y;// Declares a block-scoped, un-reassignable variable named `z`, and sets it to// a string literal. The `const` keyword was also introduced in ECMAScript 2015,// and must be explicitly assigned to.// The keyword `const` means constant, hence the variable cannot be reassigned// as the value is `constant`.const z = "this value cannot be reassigned!";// Declares a global-scoped variable and assigns 3. This is generally considered// bad practice, and will not work if strict mode is on.t = 3;// Declares a variable named `myNumber`, and assigns a number literal (the value// `2`) to it.let myNumber = 2;// Reassigns `myNumber`, setting it to a string literal (the value `"foo"`).// JavaScript is a dynamically-typed language, so this is legal.myNumber = "foo";Note the comments in the example above, all of which were preceded withtwo forward slashes.There is no built-in Input/output functionality in JavaScript, insteadit is provided by the run-time environment. The ECMAScript specificationin edition 5.1 mentions that "there are no provisions in thisspecification for input of external data or output of computedresults".^([73])However, most runtime environments have a console objectthat can be used to print output.^([74]) Here is a minimalist HelloWorld program in JavaScript in a runtime environment with a consoleobject: console.log("Hello, World!");In HTML documents, a program like this is required for an output: // Text nodes can be made using the "write" method.// This is frowned upon, as it can overwrite the document if the document is fully loaded.document.write('foo');// Elements can be made too. First, they have to be created in the DOM.const myElem = document.createElement('span');// Attributes like classes and the id can be set as wellmyElem.classList.add('foo');myElem.id = 'bar';// After setting this, the tag will look like this: `<span class="foo" id="bar" data-attr="baz"></span>`myElem.setAttribute('data-attr', 'baz'); // Which could also be written as `myElem.dataset.attr = 'baz'`// Finally append it as a child element to the <body> in the HTMLdocument.body.appendChild(myElem);// Elements can be imperatively grabbed with querySelector for one element, or querySelectorAll for multiple elements that can be looped with forEachdocument.querySelector('.class'); // Selects the first element with the "class" classdocument.querySelector('#id'); // Selects the first element with an `id` of "id"document.querySelector('[data-other]'); // Selects the first element with the "data-other" attributedocument.querySelectorAll('.multiple'); // Returns an Array of all elements with the "multiple" classA simple recursive function to calculate the factorial of a naturalnumber: function factorial(n) { // Checking the argument for legitimacy. Factorial is defined for positive integers. if (isNaN(n)) { console.error("Non-numerical argument not allowed."); return NaN; // The special value: Not a Number } if (n === 0) return 1; // 0! = 1 if (n < 0) return undefined; // Factorial of negative numbers is not defined. if (n % 1) { console.warn(`${n} will be rounded to the closest integer. For non-integers consider using gamma function instead.`); n = Math.round(n); } // The above checks need not be repeated in the recursion, hence defining the actual recursive part separately below. // The following line is a function expression to recursively compute the factorial. It uses the arrow syntax introduced in ES6. const recursivelyCompute = a => a > 1 ? a * recursivelyCompute(a - 1) : 1; // Note the use of the ternary operator `?`. return recursivelyCompute(n);}factorial(3); // Returns 6An anonymous function (or lambda): const counter = function() { let count = 0; return function() { return ++count; }};const x = counter();x(); // Returns 1x(); // Returns 2x(); // Returns 3This example shows that, in JavaScript, function closures capture theirnon-local variables by reference.Arrow functions were first introduced in 6th Edition - ECMAScript 2015.They shorten the syntax for writing functions in JavaScript. Arrowfunctions are anonymous, so a variable is needed to refer to them inorder to invoke them after their creation, unless surrounded byparenthesis and executed immediately.Example of arrow function: // Arrow functions let us omit the `function` keyword.// Here `long_example` points to an anonymous function value.const long_example = (input1, input2) => { console.log("Hello, World!"); const output = input1 + input2; return output;};// If there are no braces, the arrow function simply returns the expression// So here it's (input1 + input2)const short_example = (input1, input2) => input1 + input2;long_example(2, 3); // Prints "Hello, World!" and returns 5short_example(2, 5); // Returns 7// If an arrow function has only one parameter, the parentheses can be removed.const no_parentheses = input => input + 2;no_parentheses(3); // Returns 5// An arrow function, like other function definitions, can be executed in the same statement as they are created.// This is useful when writing libraries to avoid filling the global scope, and for closures.let three = ((a, b) => a + b) (1, 2);const generate_multiplier_function = a => (b => isNaN(b) || !b ? a : a*=b);const five_multiples = generate_multiplier_function(5); // The supplied argument "seeds" the expression and is retained by a.five_multiples(1); // Returns 5five_multiples(3); // Returns 15five_multiples(4); // Returns 60In JavaScript, objects can be created as instances of a class.Object class example: class Ball { constructor(radius) { this.radius = radius; this.area = Math.PI * ( radius ** 2 ); } // Classes (and thus objects) can contain functions known as methods show() { console.log(this.radius); }};const myBall = new Ball(5); // Creates a new instance of the ball object with radius 5myBall.radius++; // Object properties can usually be modified from the outsidemyBall.show(); // Using the inherited "show" function logs "6"In JavaScript, objects can be instantiated directly from a function.Object functional example: function Ball(radius) { const area = Math.PI * ( radius ** 2 ); const obj = { radius, area }; // Objects are mutable, and functions can be added as properties. obj.show = () => console.log(obj.radius); return obj;};const myBall = Ball(5); // Creates a new ball object with radius 5. No "new" keyword needed.myBall.radius++; // The instance property can be modified.myBall.show(); // Using the "show" function logs "6" - the new instance value.Variadic function demonstration (arguments is a specialvariable):^([75]) function sum() { let x = 0; for (let i = 0; i < arguments.length; ++i) x += arguments[i]; return x;}sum(1, 2); // Returns 3sum(1, 2, 3); // Returns 6// As of ES6, using the rest operator.function sum(...args) { return args.reduce((a, b) => a + b);}sum(1, 2); // Returns 3sum(1, 2, 3); // Returns 6Immediately-invoked function expressions are often used to createclosures. Closures allow gathering properties and methods in a namespaceand making some of them private: let counter = (function() { let i = 0; // Private property return { // Public methods get: function() { alert(i); }, set: function(value) { i = value; }, increment: function() { alert(++i); } };})(); // Modulecounter.get(); // Returns 0counter.set(6);counter.increment(); // Returns 7counter.increment(); // Returns 8Generator objects (in the form of generator functions) provide afunction which can be called, exited, and re-entered while maintaininginternal context (statefulness).^([76]) function* rawCounter() { yield 1; yield 2;}function* dynamicCounter() { let count = 0; while (true) { // It is not recommended to utilize while true loops in most cases. yield ++count; }}// Instancesconst counter1 = rawCounter();const counter2 = dynamicCounter();// Implementationcounter1.next(); // {value: 1, done: false}counter1.next(); // {value: 2, done: false}counter1.next(); // {value: undefined, done: true}counter2.next(); // {value: 1, done: false}counter2.next(); // {value: 2, done: false}counter2.next(); // {value: 3, done: false}// ...infinitelyJavaScript can export and import from modules:^([77])Export example: /* mymodule.js */// This function remains private, as it is not exportedlet sum = (a, b) => { return a + b;}// Export variablesexport let name = 'Alice';export let age = 23;// Export named functionsexport function add(num1, num2) { return num1 + num2;}// Export classexport class Multiplication { constructor(num1, num2) { this.num1 = num1; this.num2 = num2; } add() { return sum(this.num1, this.num2); }}Import example: // Import one propertyimport { add } from './mymodule.js';console.log(add(1, 2));//> 3// Import multiple propertiesimport { name, age } from './mymodule.js';console.log(name, age);//> "Alice", 23// Import all properties from a moduleimport * from './module.js'console.log(name, age);//> "Alice", 23console.log(add(1,2));//> 3More advanced example[edit]This sample code displays various JavaScript features. /* Finds the lowest common multiple (LCM) of two numbers */function LCMCalculator(x, y) { // constructor function if (isNaN(x*y)) throw new TypeError("Non-numeric arguments not allowed."); const checkInt = function(x) { // inner function if (x % 1 !== 0) throw new TypeError(x + "is not an integer"); return x; }; this.a = checkInt(x) // semicolons ^^^^ are optional, a newline is enough this.b = checkInt(y);}// The prototype of object instances created by a constructor is// that constructor's "prototype" property.LCMCalculator.prototype = { // object literal constructor: LCMCalculator, // when reassigning a prototype, set the constructor property appropriately gcd: function() { // method that calculates the greatest common divisor // Euclidean algorithm: let a = Math.abs(this.a), b = Math.abs(this.b), t; if (a < b) { // swap variables // t = b; b = a; a = t; [a, b] = [b, a]; // swap using destructuring assignment (ES6) } while (b !== 0) { t = b; b = a % b; a = t; } // Only need to calculate GCD once, so "redefine" this method. // (Actually not redefinition—it's defined on the instance itself, // so that this.gcd refers to this "redefinition" instead of LCMCalculator.prototype.gcd. // Note that this leads to a wrong result if the LCMCalculator object members "a" and/or "b" are altered afterwards.) // Also, 'gcd' === "gcd", this['gcd'] === this.gcd this['gcd'] = function() { return a; }; return a; }, // Object property names can be specified by strings delimited by double (") or single (') quotes. "lcm": function() { // Variable names do not collide with object properties, e.g., |lcm| is not |this.lcm|. // not using |this.a*this.b| to avoid FP precision issues let lcm = this.a / this.gcd() * this.b; // Only need to calculate lcm once, so "redefine" this method. this.lcm = function() { return lcm; }; return lcm; }, // Methods can also be declared using ES6 syntax toString() { // Using both ES6 template literals and the (+) operator to concatenate values return `LCMCalculator: a = ${this.a}, b = ` + this.b; }};// Define generic output function; this implementation only works for Web browsersfunction output(x) { document.body.appendChild(document.createTextNode(x)); document.body.appendChild(document.createElement('br'));}// Note: Array's map() and forEach() are defined in JavaScript 1.6.// They are used here to demonstrate JavaScript's inherent functional nature.[ [25, 55], [21, 56], [22, 58], [28, 56]].map(function(pair) { // array literal + mapping function return new LCMCalculator(pair[0], pair[1]);}).sort((a, b) => a.lcm() - b.lcm()) // sort with this comparative function; => is a shorthand form of a function, called "arrow function" .forEach(printResult);function printResult(obj) { output(obj + ", gcd = " + obj.gcd() + ", lcm = " + obj.lcm());}The following output should be displayed in the browser window. LCMCalculator: a = 28, b = 56, gcd = 28, lcm = 56LCMCalculator: a = 21, b = 56, gcd = 7, lcm = 168LCMCalculator: a = 25, b = 55, gcd = 5, lcm = 275LCMCalculator: a = 22, b = 58, gcd = 2, lcm = 638Security[edit]See also: Browser securityJavaScript and the DOM provide the potential for malicious authors todeliver scripts to run on a client computer via the Web. Browser authorsminimize this risk using two restrictions. First, scripts run in asandbox in which they can only perform Web-related actions, notgeneral-purpose programming tasks like creating files. Second, scriptsare constrained by the same-origin policy: scripts from one Website donot have access to information such as usernames, passwords, or cookiessent to another site. Most JavaScript-related security bugs are breachesof either the same origin policy or the sandbox.There are subsets of general JavaScript—ADsafe, Secure ECMAScript(SES)—that provide greater levels of security, especially on codecreated by third parties (such as advertisements).^([78][79]) ClosureToolkit is another project for safe embedding and isolation ofthird-party JavaScript and HTML.^([80])Content Security Policy is the main intended method of ensuring thatonly trusted code is executed on a Web page.Cross-site vulnerabilities[edit]Main articles: Cross-site scripting and Cross-site request forgeryA common JavaScript-related security problem is cross-site scripting(XSS), a violation of the same-origin policy. XSS vulnerabilities occurwhen an attacker can cause a target Website, such as an online bankingwebsite, to include a malicious script in the webpage presented to avictim. The script in this example can then access the bankingapplication with the privileges of the victim, potentially disclosingsecret information or transferring money without the victim'sauthorization. A solution to XSS vulnerabilities is to use HTML escapingwhenever displaying untrusted data.Some browsers include partial protection against reflected XSS attacks,in which the attacker provides a URL including malicious script.However, even users of those browsers are vulnerable to other XSSattacks, such as those where the malicious code is stored in a database.Only correct design of Web applications on the server-side can fullyprevent XSS.XSS vulnerabilities can also occur because of implementation mistakes bybrowser authors.^([81])Another cross-site vulnerability is cross-site request forgery (CSRF).In CSRF, code on an attacker's site tricks the victim's browser intotaking actions the user did not intend at a target site (liketransferring money at a bank). When target sites rely solely on cookiesfor request authentication, requests originating from code on theattacker's site can carry the same valid login credentials of theinitiating user. In general, the solution to CSRF is to require anauthentication value in a hidden form field, and not only in thecookies, to authenticate any request that might have lasting effects.Checking the HTTP Referrer header can also help."JavaScript hijacking" is a type of CSRF attack in which a <script> tagon an attacker's site exploits a page on the victim's site that returnsprivate information such as JSON or JavaScript. Possible solutionsinclude:- requiring an authentication token in the POST and GET parameters for any response that returns private information.Misplaced trust in the client[edit]Developers of client-server applications must recognize that untrustedclients may be under the control of attackers. The application authorcannot assume that their JavaScript code will run as intended (or atall) because any secret embedded in the code could be extracted by adetermined adversary. Some implications are:- Website authors cannot perfectly conceal how their JavaScript operates because the raw source code must be sent to the client. The code can be obfuscated, but obfuscation can be reverse-engineered.- JavaScript form validation only provides convenience for users, not security. If a site verifies that the user agreed to its terms of service, or filters invalid characters out of fields that should only contain numbers, it must do so on the server, not only the client.- Scripts can be selectively disabled, so JavaScript cannot be relied on to prevent operations such as right-clicking on an image to save it.^([82])- It is considered very bad practice to embed sensitive information such as passwords in JavaScript because it can be extracted by an attacker.^([83])Misplaced trust in developers[edit]Package management systems such as npm and Bower are popular withJavaScript developers. Such systems allow a developer to easily managetheir program's dependencies upon other developers' program libraries.Developers trust that the maintainers of the libraries will keep themsecure and up to date, but that is not always the case. A vulnerabilityhas emerged because of this blind trust. Relied-upon libraries can havenew releases that cause bugs or vulnerabilities to appear in allprograms that rely upon the libraries. Inversely, a library can gounpatched with known vulnerabilities out in the wild. In a study donelooking over a sample of 133,000 websites, researchers found 37% of thewebsites included a library with at least one knownvulnerability.^([84]) "The median lag between the oldest library versionused on each website and the newest available version of that library is1,177 days in ALEXA, and development of some libraries still in activeuse ceased years ago."^([84]) Another possibility is that the maintainerof a library may remove the library entirely. This occurred in March2016 when Azer Koçulu removed his repository from npm. This caused tensof thousands of programs and websites depending upon his libraries tobreak.^([85][86])Browser and plugin coding errors[edit]JavaScript provides an interface to a wide range of browsercapabilities, some of which may have flaws such as buffer overflows.These flaws can allow attackers to write scripts that would run any codethey wish on the user's system. This code is not by any means limited toanother JavaScript application. For example, a buffer overrun exploitcan allow an attacker to gain access to the operating system's API withsuperuser privileges.These flaws have affected major browsers including Firefox,^([87])Internet Explorer,^([88]) and Safari.^([89])Plugins, such as video players, Adobe Flash, and the wide range ofActiveX controls enabled by default in Microsoft Internet Explorer, mayalso have flaws exploitable via JavaScript (such flaws have beenexploited in the past).^([90][91])In Windows Vista, Microsoft has attempted to contain the risks of bugssuch as buffer overflows by running the Internet Explorer process withlimited privileges.^([92]) Google Chrome similarly confines its pagerenderers to their own "sandbox".Sandbox implementation errors[edit]Web browsers are capable of running JavaScript outside the sandbox, withthe privileges necessary to, for example, create or delete files. Suchprivileges are not intended to be granted to code from the Web.Incorrectly granting privileges to JavaScript from the Web has played arole in vulnerabilities in both Internet Explorer^([93]) andFirefox.^([94]) In Windows XP Service Pack 2, Microsoft demotedJScript's privileges in Internet Explorer.^([95])Microsoft Windows allows JavaScript source files on a computer's harddrive to be launched as general-purpose, non-sandboxed programs (see:Windows Script Host). This makes JavaScript (like VBScript) atheoretically viable vector for a Trojan horse, although JavaScriptTrojan horses are uncommon in practice.^([96][failed verification])Hardware vulnerabilities[edit]In 2015, a JavaScript-based proof-of-concept implementation of arowhammer attack was described in a paper by securityresearchers.^([97][98][99][100])In 2017, a JavaScript-based attack via browser was demonstrated thatcould bypass ASLR. It's called "ASLR⊕Cache" or AnC.^([101][102])In 2018, the paper that announced the Spectre attacks againstSpeculative Execution in Intel and other processors included aJavaScript implementation.^([103])Development tools[edit]Important tools have evolved with the language.- Every major web browser has built-in web development tools, including a JavaScript debugger.- Static program analysis tools, such as ESLint and JSLint, scan JavaScript code for conformance to a set of standards and guidelines.- Some browsers have built-in profilers. Stand-alone profiling libraries have also been created, such as benchmark.js and jsbench.^([104][105])- Many text editors have syntax highlighting support for JavaScript code.Related technologies[edit]Java[edit]A common misconception is that JavaScript is the same as Java. Bothindeed have a C-like syntax (the C language being their most immediatecommon ancestor language). They are also typically sandboxed (when usedinside a browser), and JavaScript was designed with Java's syntax andstandard library in mind. In particular, all Java keywords were reservedin original JavaScript, JavaScript's standard library follows Java'snaming conventions, and JavaScript's Math and Date objects are based onclasses from Java 1.0.^([106])Java and JavaScript both first appeared in 1995, but Java was developedby James Gosling of Sun Microsystems and JavaScript by Brendan Eich ofNetscape Communications.The differences between the two languages are more prominent than theirsimilarities. Java has static typing, while JavaScript's typing isdynamic. Java is loaded from compiled bytecode, while JavaScript isloaded as human-readable source code. Java's objects are class-based,while JavaScript's are prototype-based. Finally, Java did not supportfunctional programming until Java 8, while JavaScript has done so fromthe beginning, being influenced by Scheme.JSON[edit]JSON, or JavaScript Object Notation, is a general-purpose datainterchange format that is defined as a subset of JavaScript's objectliteral syntax.TypeScript[edit]TypeScript (TS) is a strictly-typed variant of JavaScript. TS differs byintroducing type annotations to variables and functions, and introducinga type language to describe the types within JS. Otherwise TS sharesmuch the same featureset as JS, to allow it to be easily transpiled toJS for running client-side, and to interoperate with other JScode.^([107])WebAssembly[edit]Since 2017, web browsers have supported WebAssembly, a binary formatthat enables a JavaScript engine to execute performance-criticalportions of web page scripts close to native speed.^([108]) WebAssemblycode runs in the same sandbox as regular JavaScript code.asm.js is a subset of JavaScript that served as the forerunner ofWebAssembly.^([109])Transpilers[edit]JavaScript is the dominant client-side language of the Web, and manywebsites are script-heavy. Thus transpilers have been created to convertcode written in other languages, which can aid the developmentprocess.^([34])References[edit]1. ^ ^(a) ^(b) "Netscape and Sun announce JavaScript, the Open, Cross-platform Object Scripting Language for Enterprise Networks and the Internet" (Press release). December 4, 1995. Archived from the original on 2007-09-16.2. ^ "ECMAScript® 2021 language specification". June 2021. Retrieved 27 July 2021.3. ^ https://tc39.es/ecma262/; retrieved: 27 July 2021; publication date: 22 July 2021.4. ^ "nodejs/node-eps". GitHub. Archived from the original on 2020-08-29. Retrieved 2018-07-05.5. ^ ^(a) ^(b) Seibel, Peter (September 16, 2009). Coders at Work: Reflections on the Craft of Programming. ISBN 9781430219484. Archived from the original on December 24, 2020. Retrieved December 25, 2018. “Eich: The immediate concern at Netscape was it must look like Java.”6. ^ ^(a) ^(b) ^(c) ^(d) ^(e) "Chapter 4. How JavaScript Was Created". speakingjs.com. Archived from the original on 2020-02-27. Retrieved 2017-11-21.7. ^ "Popularity – Brendan Eich".8. ^ "Brendan Eich: An Introduction to JavaScript, JSConf 2010". YouTube. p. 22m. Archived from the original on August 29, 2020. Retrieved November 25, 2019. “Eich: "function", eight letters, I was influenced by AWK.”9. ^ Eich, Brendan (1998). "Foreword". In Goodman, Danny (ed.). JavaScript Bible (3rd ed.). John Wiley & Sons. ISBN 0-7645-3188-3. LCCN 97078208. OCLC 38888873. OL 712205M.10. ^ "ECMAScript® 2020 Language Specification". Archived from the original on 2020-05-08. Retrieved 2020-05-08.11. ^ "Bloomberg Game Changers: Marc Andreessen". Bloomberg. Bloomberg. March 17, 2011. Archived from the original on May 16, 2012. Retrieved December 7, 2011.12. ^ Enzer, Larry (August 31, 2018). "The Evolution of the Web Browsers". Monmouth Web Developers. Archived from the original on August 31, 2018. Retrieved August 31, 2018.13. ^ Dickerson, Gordon (August 31, 2018). "Learn the History of Web Browsers". washingtonindependent.com. Retrieved August 31, 2018.14. ^ "TechVision: Innovators of the Net: Brendan Eich and JavaScript". Archived from the original on February 8, 2008.15. ^ Fin JS (June 17, 2016), Brendan Eich – CEO of Brave, archived from the original on February 10, 2019, retrieved February 7, 201816. ^ ^(a) ^(b) "Chapter 5. Standardization: ECMAScript". speakingjs.com. Archived from the original on 1 November 2021. Retrieved 1 November 2021.17. ^ ^(a) ^(b) Champeon, Steve (April 6, 2001). "JavaScript, How Did We Get Here?". oreilly.com. Archived from the original on July 19, 2016. Retrieved July 16, 2016.18. ^ "Microsoft Internet Explorer 3.0 Beta Now Available". microsoft.com. Microsoft. May 29, 1996. Archived from the original on November 24, 2020. Retrieved July 16, 2016.19. ^ McCracken, Harry (September 16, 2010). "The Unwelcome Return of "Best Viewed with Internet Explorer"". technologizer.com. Archived from the original on June 23, 2018. Retrieved July 16, 2016.20. ^ Baker, Loren (November 24, 2004). "Mozilla Firefox Internet Browser Market Share Gains to 7.4%". Search Engine Journal. Archived from the original on May 7, 2021. Retrieved May 8, 2021.21. ^ Weber, Tim (May 9, 2005). "The assault on software giant Microsoft". BBC News. Archived from the original on September 25, 2017.22. ^ "Big browser comparison test: Internet Explorer vs. Firefox, Opera, Safari and Chrome". PC Games Hardware. Computec Media AG. 3 July 2009. Archived from the original on May 2, 2012. Retrieved June 28, 2010.23. ^ Purdy, Kevin (June 11, 2009). "Lifehacker Speed Tests: Safari 4, Chrome 2". Lifehacker. Archived from the original on April 14, 2021. Retrieved May 8, 2021.24. ^ "TraceMonkey: JavaScript Lightspeed, Brendan Eich's Blog". Archived from the original on December 4, 2015. Retrieved July 22, 2020.25. ^ "Mozilla asks, 'Are we fast yet?'". Wired. Archived from the original on June 22, 2018. Retrieved January 18, 2019.26. ^ "ECMAScript 6: New Features: Overview and Comparison". es6-features.org. Archived from the original on March 18, 2018. Retrieved March 19, 2018.27. ^ Professional Node.js: Building JavaScript Based Scalable Software Archived 2017-03-24 at the Wayback Machine, John Wiley & Sons, 01-Oct-201228. ^ Sams Teach Yourself Node.js in 24 Hours Archived 2017-03-23 at the Wayback Machine, Sams Publishing, 05-Sep-201229. ^ Lawton, George (19 July 2018). "The secret history behind the success of npm and Node". TheServerSide. Archived from the original on 2 August 2021. Retrieved 2 August 2021.30. ^ Brown, Paul (13 January 2017). "State of the Union: npm". Linux.com. Archived from the original on 2 August 2021. Retrieved 2 August 2021.31. ^ ^(a) ^(b) Branscombe, Mary (2016-05-04). "JavaScript Standard Moves to Yearly Release Schedule; Here is What's New for ES16". The New Stack. Archived from the original on 2021-01-16. Retrieved 2021-01-15.32. ^ "The TC39 Process". tc39.es. Ecma International. Archived from the original on 2021-02-07. Retrieved 2021-01-15.33. ^ "ECMAScript proposals". TC39. Archived from the original on 2020-12-04. Retrieved 2021-01-15.34. ^ ^(a) ^(b) Ashkenas, Jeremy. "List of languages that compile to JS". GitHub. Archived from the original on January 31, 2020. Retrieved February 6, 2020.35. ^ "U.S. Trademark Serial No. 75026640". uspto.gov. United States Patent and Trademark Office. 1997-05-06. Archived from the original on 2021-07-13. Retrieved 2021-05-08.36. ^ "Legal Notices". oracle.com. Oracle Corporation. Archived from the original on 2021-06-05. Retrieved 2021-05-08.37. ^ "Oracle to buy Sun in $7.4-bn deal - The Economic Times". The Economic Times.38. ^ "Usage statistics of JavaScript as client-side programming language on websites". w3techs.com. 2021-04-09. Archived from the original on 2022-02-13. Retrieved 2021-04-09.39. ^ ^(a) ^(b) ^(c) "Usage statistics of JavaScript libraries for websites". w3techs.com. Archived from the original on 2012-05-26. Retrieved 2021-04-09.40. ^ "Vanilla JS". vanilla-js.com. 2020-06-16. Archived from the original on June 16, 2020. Retrieved June 17, 2020.41. ^ "Server-Side JavaScript Guide". oracle.com. Oracle Corporation. December 11, 1998. Archived from the original on March 11, 2021. Retrieved May 8, 2021.42. ^ Clinick, Andrew (July 14, 2000). "Introducing JScript .NET". Microsoft Developer Network. Microsoft. Archived from the original on November 10, 2017. Retrieved April 10, 2018. “[S]ince the 1996 introduction of JScript version 1.0 ... we've been seeing a steady increase in the usage of JScript on the server—particularly in Active Server Pages (ASP)”43. ^ ^(a) ^(b) Mahemoff, Michael (December 17, 2009). "Server-Side JavaScript, Back with a Vengeance". readwrite.com. Archived from the original on June 17, 2016. Retrieved July 16, 2016.44. ^ "JavaScript for Acrobat". adobe.com. 2009-08-07. Archived from the original on August 7, 2009. Retrieved August 18, 2009.45. ^ treitter (2013-02-02). "Answering the question: "How do I develop an app for GNOME?"". livejournal.com. Archived from the original on 2013-02-11. Retrieved 2013-02-07.46. ^ "Tessel 2... Leverage all the libraries of Node.JS to create useful devices in minutes with Tessel". tessel.io. Archived from the original on 2021-05-26. Retrieved 2021-05-08.47. ^ "Node.js Raspberry Pi GPIO Introduction". w3schools.com. Archived from the original on 2021-08-13. Retrieved 2020-05-03.48. ^ "Espruino – JavaScript for Microcontrollers". espruino.com. Archived from the original on 2020-05-01. Retrieved 2020-05-03.49. ^ Flanagan, David (August 17, 2006). JavaScript: The Definitive Guide: The Definitive Guide. "O'Reilly Media, Inc.". p. 16. ISBN 978-0-596-55447-7. Archived from the original on August 1, 2020. Retrieved March 29, 2019.50. ^ ^(a) ^(b) ^(c) ^(d) Korolev, Mikhail (2019-03-01). "JavaScript quirks in one image from the Internet". The DEV Community. Archived from the original on October 28, 2019. Retrieved October 28, 2019.51. ^ "Wat". www.destroyallsoftware.com. 2012. Archived from the original on October 28, 2019. Retrieved October 28, 2019.52. ^ "JavaScript data types and data structures – JavaScript | MDN". Developer.mozilla.org. February 16, 2017. Archived from the original on March 14, 2017. Retrieved February 24, 2017.53. ^ Flanagan 2006, pp. 176–178.54. ^ Crockford, Douglas. "Prototypal Inheritance in JavaScript". Archived from the original on 13 August 2013. Retrieved 20 August 2013.55. ^ "Inheritance and the prototype chain". Mozilla Developer Network. Mozilla. Archived from the original on April 25, 2013. Retrieved April 6, 2013.56. ^ Herman, David (2013). Effective JavaScript. Addison-Wesley. p. 83. ISBN 978-0-321-81218-6.57. ^ Haverbeke, Marijn (2011). Eloquent JavaScript. No Starch Press. pp. 95–97. ISBN 978-1-59327-282-1.58. ^ Katz, Yehuda (12 August 2011). "Understanding "Prototypes" in JavaScript". Archived from the original on 5 April 2013. Retrieved April 6, 2013.59. ^ Herman, David (2013). Effective JavaScript. Addison-Wesley. pp. 125–127. ISBN 978-0-321-81218-6.60. ^ "Function – JavaScript". MDN Web Docs. Retrieved 2021-10-30.61. ^ "Properties of the Function Object". Es5.github.com. Archived from the original on January 28, 2013. Retrieved May 26, 2013.62. ^ Flanagan 2006, p. 141.63. ^ The many talents of JavaScript for generalizing Role-Oriented Programming approaches like Traits and Mixins Archived 2017-10-05 at the Wayback Machine, Peterseliger.blogpsot.de, April 11, 2014.64. ^ Traits for JavaScript Archived 2014-07-24 at the Wayback Machine, 2010.65. ^ "Home | CocktailJS". Cocktailjs.github.io. Archived from the original on February 4, 2017. Retrieved February 24, 2017.66. ^ Angus Croll, A fresh look at JavaScript Mixins Archived 2020-04-15 at the Wayback Machine, published May 31, 2011.67. ^ "Concurrency model and Event Loop". Mozilla Developer Network. Archived from the original on September 5, 2015. Retrieved August 28, 2015.68. ^ Haverbeke, Marijn (2011). Eloquent JavaScript. No Starch Press. pp. 139–149. ISBN 978-1-59327-282-1.69. ^ "E4X – Archive of obsolete content | MDN". Mozilla Developer Network. Mozilla Foundation. February 14, 2014. Archived from the original on July 24, 2014. Retrieved July 13, 2014.70. ^ "var – JavaScript – MDN". The Mozilla Developer Network. Archived from the original on December 23, 2012. Retrieved December 22, 2012.71. ^ "let". MDN web docs. Mozilla. Archived from the original on May 28, 2019. Retrieved June 27, 2018.72. ^ "const". MDN web docs. Mozilla. Archived from the original on June 28, 2018. Retrieved June 27, 2018.73. ^ "ECMAScript Language Specification – ECMA-262 Edition 5.1". Ecma International. Archived from the original on November 26, 2012. Retrieved December 22, 2012.74. ^ "console". Mozilla Developer Network. Mozilla. Archived from the original on February 28, 2013. Retrieved April 6, 2013.75. ^ "arguments". Mozilla Developer Network. Mozilla. Archived from the original on April 13, 2013. Retrieved April 6, 2013.76. ^ "function* - JavaScript | MDN". developer.mozilla.org. Retrieved 2022-09-27.77. ^ "JavaScript modules". MDN Web Docs. Mozilla. Archived from the original on 17 July 2022. Retrieved 28 July 2022.78. ^ "Making JavaScript Safe for Advertising". ADsafe. Archived from the original on 2021-07-06. Retrieved 2021-05-08.79. ^ "Secure ECMA Script (SES)". Archived from the original on May 15, 2013. Retrieved May 26, 2013.80. ^ "Google Caja Project". Google. Archived from the original on 2021-01-22. Retrieved 2021-07-09.81. ^ "Mozilla Cross-Site Scripting Vulnerability Reported and Fixed – MozillaZine Talkback". Mozillazine.org. Archived from the original on July 21, 2011. Retrieved February 24, 2017.82. ^ Kottelin, Thor (17 June 2008). "Right-click "protection"? Forget about it". blog.anta.net. Archived from the original on 28 July 2022. Retrieved 28 July 2022.83. ^ Rehorik, Jan (29 November 2016). "Why You Should Never Put Sensitive Data in Your JavaScript". ServiceObjects Blog. ServiceObjects. Archived from the original on June 3, 2019. Retrieved June 3, 2019.84. ^ ^(a) ^(b) Lauinger, Tobias; Chaabane, Abdelberi; Arshad, Sajjad; Robertson, William; Wilson, Christo; Kirda, Engin (December 21, 2016). Thou Shalt Not Depend on Me: Analysing the Use of Outdated JavaScript Libraries on the Web (PDF). Northeastern University. arXiv:1811.00918. doi:10.14722/ndss.2017.23414. ISBN 978-1-891562-46-4. S2CID 17885720. Archived from the original (PDF) on 29 March 2017. Retrieved 28 July 2022.85. ^ Collins, Keith (March 27, 2016). "How one programmer broke the internet by deleting a tiny piece of code". Quartz. Archived from the original on February 22, 2017. Retrieved February 22, 2017.86. ^ SC Magazine UK, Developer's 11 lines of deleted code 'breaks the internet' Archived February 23, 2017, at the Wayback Machine87. ^ Mozilla Corporation, Buffer overflow in crypto.signText() Archived 2014-06-04 at the Wayback Machine88. ^ Festa, Paul (August 19, 1998). "Buffer-overflow bug in IE". CNET. Archived from the original on December 25, 2002.89. ^ SecurityTracker.com, Apple Safari JavaScript Buffer Overflow Lets Remote Users Execute Arbitrary Code and HTTP Redirect Bug Lets Remote Users Access Files Archived 2010-02-18 at the Wayback Machine90. ^ SecurityFocus, Microsoft WebViewFolderIcon ActiveX Control Buffer Overflow Vulnerability Archived 2011-10-11 at the Wayback Machine91. ^ Fusion Authority, Macromedia Flash ActiveX Buffer Overflow Archived August 13, 2011, at the Wayback Machine92. ^ "Protected Mode in Vista IE7 – IEBlog". Blogs.msdn.com. February 9, 2006. Archived from the original on January 23, 2010. Retrieved February 24, 2017.93. ^ US CERT, Vulnerability Note VU#713878: Microsoft Internet Explorer does not properly validate source of redirected frame Archived 2009-10-30 at the Wayback Machine94. ^ Mozilla Foundation, Mozilla Foundation Security Advisory 2005–41: Privilege escalation via DOM property overrides Archived 2014-06-04 at the Wayback Machine95. ^ Andersen, Starr (2004-08-09). "Part 5: Enhanced Browsing Security". TechNet. Microsoft Docs. Changes to Functionality in Windows XP Service Pack 2. Retrieved 2021-10-20.96. ^ For one example of a rare JavaScript Trojan Horse, see Symantec Corporation, JS.Seeker.K Archived 2011-09-13 at the Wayback Machine97. ^ Gruss, Daniel; Maurice, Clémentine; Mangard, Stefan (July 24, 2015). "Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript". arXiv:1507.06955 [cs.CR].98. ^ Jean-Pharuns, Alix (July 30, 2015). "Rowhammer.js Is the Most Ingenious Hack I've Ever Seen". Motherboard. Vice. Archived from the original on January 27, 2018. Retrieved January 26, 2018.99. ^ Goodin, Dan (August 4, 2015). "DRAM 'Bitflipping' exploit for attacking PCs: Just add JavaScript". Ars Technica. Archived from the original on January 27, 2018. Retrieved January 26, 2018.100. ^ Auerbach, David (July 28, 2015). "Rowhammer security exploit: Why a new security attack is truly terrifying". slate.com. Archived from the original on July 30, 2015. Retrieved July 29, 2015.101. ^ AnC Archived 2017-03-16 at the Wayback Machine VUSec, 2017102. ^ New ASLR-busting JavaScript is about to make drive-by exploits much nastier Archived 2017-03-16 at the Wayback Machine Ars Technica, 2017103. ^ Spectre Attack Archived 2018-01-03 at the Wayback Machine Spectre Attack104. ^ "Benchmark.js". benchmarkjs.com. Archived from the original on 2016-12-19. Retrieved 2016-11-06.105. ^ JSBEN.CH. "JSBEN.CH Performance Benchmarking Playground for JavaScript". jsben.ch. Archived from the original on 2021-02-27. Retrieved 2021-08-13.106. ^ Eich, Brendan (April 3, 2008). "Popularity". Archived from the original on July 3, 2011. Retrieved January 19, 2012.107. ^ "TypeScript: JavaScript With Syntax For Types". Typescriptlang.org. Retrieved 2022-08-12.108. ^ "Edge Browser Switches WebAssembly to 'On' -- Visual Studio Magazine". Visual Studio Magazine. Archived from the original on 2018-02-10. Retrieved 2018-02-09.109. ^ "frequently asked questions". asm.js. Archived from the original on June 4, 2014. Retrieved April 13, 2014.Further reading[edit]See also: ECMAScript Specification Documents- Flanagan, David. JavaScript: The Definitive Guide. 7th edition. Sebastopol, California: O'Reilly, 2020.- Haverbeke, Marijn. Eloquent JavaScript. 3rd edition. No Starch Press, 2018. 472 pages. ISBN 978-1593279509.(download)- Zakas, Nicholas. Principles of Object-Oriented JavaScript, 1st edition. No Starch Press, 2014. 120 pages. ISBN 978-1593275402.External links[edit]JavaScript at Wikipedia's sister projects- []Definitions from Wiktionary- []Media from Commons- []Textbooks from Wikibooks- []Resources from Wikiversity- []Documentation from MediaWikiListen to this article (48 minutes)[Spoken Wikipedia icon]This audio file was created from a revision of this article dated20 August 2013 (2013-08-20), and does not reflect subsequent edits.(Audio help · More spoken articles)- JavaScript at Curlie- "JavaScript: The First 20 Years". Retrieved 2022-02-06.+-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || JavaScript | |+-----------------------------------+-----------------------------------+| Code analysis | - ESLint || | - JSHint || | - JSLint |+-----------------------------------+-----------------------------------+| Supersets | - JS++ || | - TypeScript |+-----------------------------------+-----------------------------------+| Transpilers | - AtScript || | - Babel || | - ClojureScript || | - CoffeeScript || | - Dart || | - Elm || | - Emscripten || | - Google Closure Compiler || | - Google Web Toolkit || | - Haxe || | - LiveScript || | - Morfik || | - Nim || | - Opa || | - PureScript || | - Reason || | - WebSharper |+-----------------------------------+-----------------------------------+| Concepts | - JavaScript library || | - JavaScript syntax |+-----------------------------------+-----------------------------------+| Debuggers | - Chrome DevTools || | - Firefox Inspector || | - Komodo IDE || | - Microsoft Edge DevTools || | - Opera DevTools || | - Safari Web Inspector |+-----------------------------------+-----------------------------------+| Doc generators | - JSDoc |+-----------------------------------+-----------------------------------+| Editors (comparison) | - Ace || | - Cloud9 IDE || | - Atom || | - CodeMirror || | - Brackets || | - Light Table || | - PhpStorm || | - Orion || | - Visual Studio || | - Visual Studio Express || | - Visual Studio Code || | - Visual Studio Team Services || | - Vim |+-----------------------------------+-----------------------------------+| Engines | - List of ECMAScript engines |+-----------------------------------+-----------------------------------+| Frameworks | - Comparison of JavaScript || | frameworks || | - List of JavaScript libraries |+-----------------------------------+-----------------------------------+| Related technologies | - Ajax || | - AssemblyScript || | - asm.js || | - Cascading Style Sheets || | - Document Object Model || | - HTML || | - HTML5 || | - JSON || | - WebAssembly || | - WebAuthn |+-----------------------------------+-----------------------------------+| Package managers | - npm || | - yarn |+-----------------------------------+-----------------------------------+| Module bundlers | - Webpack |+-----------------------------------+-----------------------------------+| Server-side | - Active Server Pages || | - Bun || | - CommonJS || | - Deno || | - JSGI || | - Node.js || | - Wakanda |+-----------------------------------+-----------------------------------+| Unit testing frameworks (list) | - Jasmine || | - Jest || | - Mocha || | - QUnit |+-----------------------------------+-----------------------------------+| People | - Douglas Crockford || | - Ryan Dahl || | - Brendan Eich || | - John Resig |+-----------------------------------+-----------------------------------++-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || Programming languages | |+-----------------------------------+-----------------------------------+| - Comparison | || - Timeline | || - History | |+-----------------------------------+-----------------------------------+| - Ada | || - ALGOL | || - APL | || - Assembly | || - BASIC | || - C | || - C++ | || - C# | || - Classic Visual Basic | || - COBOL | || - Erlang | || - Forth | || - Fortran | || - Go | || - Haskell | || - Java | || - JavaScript | || - Kotlin | || - Lisp | || - Lua | || - MATLAB | || - ML | || - Object Pascal | || - Pascal | || - Perl | || - PHP | || - Prolog | || - Python | || - R | || - Ruby | || - Rust | || - SQL | || - Scratch | || - Shell | || - Simula | || - Smalltalk | || - Swift | || - Visual Basic | || - more... | |+-----------------------------------+-----------------------------------+| - [] Lists: Alphabetical | || - Categorical | || - Generational | || - Non-English-based | || - [] Category | |+-----------------------------------+-----------------------------------++-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || ECMAScript | |+-----------------------------------+-----------------------------------+| Dialects | - ActionScript || | - Caja || | - JavaScript || | - engines || | - asm.js || | - JScript || | - JScript .NET || | - QtScript || | - TypeScript || | - WMLScript |+-----------------------------------+-----------------------------------+| Engines | - Carakan || (comparison) | - Futhark || | - InScript || | - JavaScriptCore || | - JScript || | - KJS || | - Linear B || | - QtScript || | - Rhino || | - SpiderMonkey || | - TraceMonkey || | - JägerMonkey || | - Tamarin || | - V8 || | - ChakraCore || | - Chakra || | - JScript .NET || | - Nashorn |+-----------------------------------+-----------------------------------+| Frameworks | +--------------+--------------+ || | | Client-side | - Dojo | || | | | - Echo | || | | | - Ext JS | || | | | - Google | || | | | Web | || | | | Toolkit | || | | | - jQuery | || | | | - Lively | || | | | Kernel | || | | | - midori | || | | | - MochiKit | || | | | - MooTools | || | | | - | || | | | Prototype | || | | | - Pyjs | || | | | - qooxdoo | || | | | - | || | | | - | || | | | SproutCore | || | | | - Spry | || | | | - Wakanda | || | | | | || | | | Framework | || | +--------------+--------------+ || | | Server-side | - AppJet | || | | | - Deno | || | | | - Jaxer | || | | | - Node.js | || | | | - | || | | | WakandaDB | || | +--------------+--------------+ || | | Multiple | - | || | | | Cappuccino | || | | | - PureMVC | || | +--------------+--------------+ || | | Libraries | - | || | | | Backbone.js | || | | | - | || | | | SWFObject | || | | | - U | || | | | nderscore.js | || | +--------------+--------------+ |+-----------------------------------+-----------------------------------+| People | - Brendan Eich || | - Douglas Crockford || | - John Resig || | - Scott Isaacs |+-----------------------------------+-----------------------------------+| Other | - DHTML || | - Ecma International || | - JSDoc || | - JSGI || | - JSHint || | - JSLint || | - JSON || | - JSSS || | - Sputnik || | - SunSpider || | - Asynchronous module || | definition || | - CommonJS |+-----------------------------------+-----------------------------------+| [] Lists | || JavaScript libraries | || Ajax frameworks | || | || [] Comparisons | || JavaScript frameworks | || server-side JavaScript | |+-----------------------------------+-----------------------------------++-----------------------------------+-----------------------------------+| - v | || - t | || - e | || | || Web browsers | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Features · | | | || | standards · | | | || | protocols | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Active | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| +--------------+--------------+ | || | Discontinued | | | || +--------------+--------------+ | || | [TABLE] | | | || +--------------+--------------+ | |+-----------------------------------+-----------------------------------+| - Category | || - Comparisons | || - List | |+-----------------------------------+-----------------------------------++-----------------------+-----------------------+-----------------------+| - v | | || - t | | || - e | | || | | || Node.js | | |+-----------------------+-----------------------+-----------------------+| Platform | - Node.js | [Node.js logo.svg] || | - NPM | || | - V8 | || | - CommonJS | |+-----------------------+-----------------------+-----------------------+| Frameworks | - MEAN | || | - MongoDB | || | - Express.js | || | | || | - AngularJS/Angular | || | - MEEN (substituted | || | with Ember.js) | || | - Backbone.js | || | - Meteor | || | - Sails.js (uses | || | Express.js) | || | - Next.js | |+-----------------------+-----------------------+-----------------------+| Libraries | - Lodash | || | - Underscore.js | || | - React.js | || | - Vue.js | |+-----------------------+-----------------------+-----------------------+| Languages | - JavaScript | || | - CoffeeScript | || | - TypeScript | |+-----------------------+-----------------------+-----------------------+Portal:- [icon] Computer programming+-----------------------------------+-----------------------------------+| Authority control [Edit this at | || Wikidata] | |+-----------------------------------+-----------------------------------+| International | - FAST |+-----------------------------------+-----------------------------------+| National | - Spain || | - France || | - BnF data || | - Germany || | - Israel || | - United States || | - Czech Republic |+-----------------------------------+-----------------------------------+| Other | - IdRef |+-----------------------------------+-----------------------------------+