A term used originally to describe the fact that if, for example, parents’ and children's weights are measured, the children's weights tend to be closer to the average than are those of their parents: unusually heavy parents tend to have lighter children and unusually light parents tend to have heavier children. This phenomenon was referred to as ‘regression to the mean’ (see central tendency (measures of)).
In statistical usage, regression refers in the simplest case (bivariate linear regression) to fitting a line to the plot of data from two variables, in order to represent the trend between them. Regression is asymmetric, that is, it assumes that one variable (Y, the dependent variable) is determined by the other (independent) variable X; that the relationship is linear (and hence that the variables are at the interval level of measurement); and that the fit is not perfect:Yi = α + βXi + ɛi(that is, the value of the dependent variable Y for individual i varies in a straight line with the value of X, together with an individual error term, e). The slope of this line is represented by a constant multiplier weight or ‘regression coefficient’, β, and a constant, α, which represents the intercept or point at which the regression line crosses the Y axis—as illustrated here.
Yi = α + βXi + ɛi
Statistically, it is assumed that the error terms (ɛi) are random with a mean of 0, and are independent of the independent variable values. The main purpose of regression analysis is to calculate the value of the slope (β), often interpreted as the overall effect of X. This is normally done by using the Least Squares principle to find a best-fitting line (in the sense that the sum of the squared error terms—discrepancies between actual Yi values and those predicted by the regression line—is as small as possible). The correlation coefficient (r) gives a measure of how well the data fit this regression line (perfectly if r = ± 1 and as poorly as possible if r = 0).
Simple regression can be extended in various ways: to more than one independent variable (multiple linear regression) and to other functions or relationships (for example monotonic or non-metric regression for ordinal variables, used in multi-dimensional scaling, and logarithmic and power regression). In the multiple linear regression, the model is written as:Yi = α + β1X1 + β2X1 + β3X1 +…βkX1 + ɛiwhere the regression weights βk now represent the effect of the independent variable Xi on Yi, controlling for (or ‘partialling out’, that is removing the linear effect of) the other independent variables. These ‘partial regression coefficients’ or ‘beta weights’ are of especial interest in causal models and structural equation systems (see M. S. Lewis-Beck, Applied Regression—An Introduction, 1990). See also logistic (or logit) regression; multicollinearity; outlier effects.
Yi = α + β1X1 + β2X1 + β3X1 +…βkX1 + ɛi