Principles of Linear Regression

In this article, we address the importance of linear regression, the concept behind it, and its assumptions.

Alysson Guimarães
13 min readFeb 12, 2024

(This is a translated version from Brazilian Portuguese of the article, the original is available here)

Introduction

The two main reasons for using regression analysis are prediction and inference. In various situations, we have a set of variables but not our variable of interest, so we use them to estimate the variable of interest by making a prediction. In other situations, we are interested in understanding the association between the variable of interest and those we possess. In this situation, we estimate the variable of interest but with the aim of making inferences, answering questions about the association of the variables, their relationship, which variable affects the variable of interest the most, which does not affect it, etc. To make this estimate, we need to find a function f such that Y ≃ f^( X) for each observation (X, Y) using parametric and non-parametric approaches. The use of the parametric approach of ordinary least squares (OLS) is one of several ways to train a linear model, and the most common.

Linear regression analyses continue to be widely used, in various fields such as social sciences and applied social sciences, with the aim of explaining and predicting phenomena and understanding the correlation between their variables with the least squares (OLS) approach. Simple/multiple regression analysis is “a statistical technique that can be used to analyze the relationship between a single dependent variable and multiple independent (predictor) variables” (Hair et al., 2009: 176). Thus, estimating the degree of association between the dependent and predictor variables. This association is defined in terms of direction, being positive or negative, and magnitude, being strong or weak.

In multiple regressions, it is possible to identify the contribution of each independent (explanatory) variable to the predictive capacity of the model. The functional form of the OLS model seeks to minimize the sum of the squares of the residuals from a line that is used to summarize the relationship between the dependent (X) and independent (Y) variables.

Accuracy vs. Interpretability

There are methods that are more flexible (e.g., Boosting) and others less flexible (OLS), in the sense that they have few ways to estimate f. Linear regression is a relatively inflexible approach because it only generates linear functions; other methods are considerably more flexible because they can produce a wider range of possibilities to estimate the function.

Trade-off between Interpretability and Flexibility

If your main goal is inference, less flexible models like linear regression are preferable because they are more interpretable. Models such as Generalized Additive Models are more flexible than linear regression, but they are less interpretable because the relationship between predictor variables is modeled using a curve.

When we are only interested in prediction, we use more flexible models. They often yield better results, but there are cases where less flexible ones are better. This can happen due to the ease of overfitting in more flexible models.

Simple Linear Regression

Simple linear regression is based on predicting a quantitative dependent variable (Y) from a single predictor variable (X), assuming an approximately linear relationship between X and Y. It is mathematically described as:

In this equation, beta zero and beta one are two unknown constants, representing the intercept and slope terms of the linear model. They are also called coefficients or parameters of the model.

After using the training data to estimate the β̂0 and β̂1 of the model, we can predict the value of ŷ, where it indicates the prediction of Y based on X = x. We use the “hat” to indicate the estimated value for an unknown parameter or coefficient, or to indicate the predicted value.

Since the coefficients β0 and β1 are unknown, we need to estimate them so that we can make predictions.

n represents the pairs of observations between the dependent and independent variables, and our goal is to estimate the coefficients β̂0 and β̂1 that best fit the regression line, with yi ≃ β̂0 + β̂1 for each i = 1,…, n.

There are several ways to measure the proximity of the predicted value to the actual one, but by far the least squares method is the most commonly used.

The ŷi = β̂0 + β̂1 xi prediction of Y based on the ith value of X, then ei = yi — ŷi represents the ith residual. The residual is the difference between the actual value and the value predicted by the linear model.

The least squares method seeks the coefficients that minimize the Residual Sum of Squares (RSS), which measures the level of variance of the residuals. It is:

The ordinary least squares (OLS) approach seeks to minimize the residual sum of squares (RSS).

We assume that the linear relationship was true between X and Y in Y = f (X) + ε for an unknown function f, where epsilon is a random error term with a mean of zero. If f is to be approximated by a linear function, then we can write this relationship as:

This model gives us a linear approximation between X and Y of the regression line of the population. Beta zero is the intercept term, that is, the expected value of Y when X = 0, and beta one is the slope, the average increase/decrease in Y associated with a unit increase/decrease in X. The error term is what we cannot explain solely with this model or variables, and it is independent of X.

With real data, the actual relationship between X and Y, or the population regression line, is not known, but the least squares line can always be calculated using estimates of the coefficients.

We measure the accuracy of the estimated coefficients through the standard error. It is a statistic that measures the variation of a sample mean relative to the population mean, helping us verify the reliability of the calculated sample mean. We obtain an estimate of the standard error by dividing the standard deviation by the square root of the sample size. For the coefficients, the standard error measures the precision with which the model estimates the value of the population coefficient. The more observations we have, the smaller the standard error. We calculate the standard error as follows:

We use the standard error to calculate confidence intervals. A 95% confidence interval is a range of values that has a 95% probability of containing the population or unknown value of the parameter, meaning that if we repeatedly take samples and construct confidence intervals for each sample, 95% of the intervals will contain the true unknown value of the parameter. This interval is defined with lower and upper bounds and is calculated from the sample data. For linear regression, the 95% confidence interval for β1 assumes the following formula:

The standard error is also used for hypothesis tests on the coefficients, such as:

H0: There is no relationship between X and Y
against the alternative hypothesis
H1: There is some relationship between X and Y

Mathematically, this corresponds to testing:

H0 : β 1 = 0 versus H1 : β1 != 0

To test the null hypothesis, we need to determine if β̂1 is sufficiently far from zero so that we can trust that β1 is not zero. How far depends on the standard error of the estimated beta 1 SE(β̂1). If it is small enough, then we have strong evidence that β1 ≠ 0 and there is a relationship between X and Y. If the standard error of the estimated beta1 is large, then the estimated beta1 must be large in absolute value, and we cannot reject the null hypothesis. We calculate the t-statistic, given by:

Which measures the number of standard deviations that the estimated beta 1 is from 0. If there really is no relationship between X and Y, then the t value has a student’s t distribution with n minus 2 degrees of freedom. Then we calculate the probability of observing any number equal to |t| or larger in absolute value, assuming β1 = 0. We call this probability the p-value. In summary: We interpret the p-value as an indication that it is unlikely to observe such a significant association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.

Therefore, if we observe a small p-value, we can infer that there is an association between the predictor and the response. Thus, we reject the null hypothesis and assert that there is a relationship between X and Y when the p-value is sufficiently small. Typical cutoff points for the p-value to reject the null hypothesis are 5% or 1%.

Once the null hypothesis rejection is confirmed, we need to know the accuracy of the model, which is done through R² and the residual standard error (RSE).

The RSE is an estimate of the standard deviation of the model’s error, it is the average deviation that the response makes from the actual regression line. Thus, it measures in absolute values how much the model did not fit, and the smaller its value, the closer the prediction is to the real value. The RSE is calculated as:

The R-squared statistic is an alternative measurement of model fit. It is the proportion of the variance explained by the model and ranges between 0 and 1. To calculate R², we use the formula:

TSS is the total sum of squares and measures the variance in response to Y, it can be defined as the amount of inherent variability in response before regression is performed. On the other hand, RSS measures the amount of variability not explained by regression performance, so TSS — RSS gives us the variability explained by the model’s performance. The closer R² is to 1, the more the proportion of variability in Y is explained using X.

Multiple Linear Regression

Simple linear regression is a great approach to predict a response based on a single predictor variable, but in practice, this almost never happens. Thus, we need to extend simple linear regression to add more predictors.

Just like in simple regression, in multiple regression, the coefficients are also unknown and must be estimated β̂0, β̂1, …, β̂p using the formula:

The parameters are estimated in the same way using the least squares approach seen before, where we seek the coefficients that minimize the sum of the residual squares.

To test the hypothesis that there is a relationship between the predictor variables and the response, we do as seen with simple regression, but comparing if all variables are equal to zero

H0: β 1 = β 2 = · · · = β p = 0

against the alternative hypothesis

H1: at least one β is different from zero

The test is performed by computing the F-statistic.

The higher the F-statistic is above one, the stronger the evidence that the alternative hypothesis is true.

Assumptions of the OLS Model

Different authors present different assumptions that need to be met for OLS regression analysis to be used properly to produce the Best Linear Unbiased Estimator (BLUE). An estimator is BLUE when it meets the properties of producing the lowest variance (Best), its relationship is linear (Linear), and its sampling distribution is unbiased (Unbiased). It is biased when it systematically overestimates or underestimates the value of the population parameter.

It is common to find a different number of assumptions in posts and articles because some of them assume that they are already met by the methodology used to understand the phenomenon before applying the model. Violating each assumption relates to a specific problem, so it is important to understand, even in a general way, the function of each of the assumptions. They are:

Implicit assumptions:

  • Measurement of variables: Poorly measured variables will produce inconsistent estimates. If they are measured with errors, the estimates of the intercept and coefficients will be biased, and tests of significance and confidence intervals will be affected. To overcome this problem of measurement errors, generalized regression models, instrumental variables, and structural equation models can be used.
  • The model must be adequately specified. All relevant independent variables should be included in the model, and no irrelevant variables should be included, as they produce inefficiency in the estimators and increase the standard error.
  • The number of observations must be greater than the number of parameters. For calculations to be made, algorithms invert the matrix, and if the number of parameters is greater than the number of observations, estimation becomes mathematically impossible.
  • Independence between predictor variables and residuals: The residual terms are independent and identically distributed random values. Being independent means that they are not related to predictor variables. In non-experimental research, as we cannot manipulate the value of the independent variable, all important variables should be in the model. But if there is correlation between them, the estimates will be biased.

Assumptions of Gauss-Markov:

  • Linearity: The relationship between dependent and independent variables must be representable by a linear function. Without linearity, there is no linear regression. The further the relationship between variables deviates from a linear function, the lower the applicability of OLS to fit the model, thus, the difference between estimated and observed parameters only increases. Linearity implies that a unit increase in X generates the same effect on Y, regardless of the initial value of X. In a nonlinear relationship where there is an association between variables, it is not possible to detect it using OLS. To detect linearity, we can perform the Harvey-Collier test and/or a scatterplot between the Y vs X variables. The null hypothesis of the Harvey-Collier test is that the regression is linear. If p-value < alpha, the regression is not linear.
  • Homoscedasticity: It means that the variance of the residuals should be constant. When Y and the residuals increase, there is heteroscedasticity. Violating this assumption affects the reliability of significance tests and confidence intervals, making them incorrect. Heteroscedastic OLS models lose the property of best estimating population parameters. One way to detect heteroscedasticity is to analyze the dispersion of residuals vs predicted values. The more random the dispersion, the more likely the model is homoscedastic, but if patterns form, there are signs of heteroscedasticity. When heteroscedasticity is detected, it is necessary to increase the number of observations and transform variables. We can also perform the Breusch-Pagan test, where the null hypothesis is that there is homoscedasticity. If p-value < alpha, there is heteroscedasticity.
  • Absence of autocorrelation between observations: In this assumption, observations and residuals should be independent. The value of one observation should not influence the next. Being independent means there is no correlation between residuals. When violated, the reliability of significance tests and confidence intervals is affected. To detect the presence of autocorrelation, the Durbin-Watson test can be used. It ranges from 0 to 4, and the closer to zero, the higher the positive correlation, and the closer to four, the higher the negative correlation. Values between 1.5 and 2.5 suggest evidence of independence between observations.
  • Multicollinearity: The estimator, even with multicollinearity, correlation between independent variables, remains unbiased (BLUE) meeting the classic assumptions (linearity, homoscedasticity, and independence of observations), but increases the magnitude of the variance of the estimated parameters. The presence of high levels of correlation between independent variables makes it impossible to accurately estimate the effect of each variable on the dependent variable. To detect multicollinearity, we can perform the Variance Inflation Factor (VIF) test and/or visualize a correlation matrix (Pearson) between independent variables. The VIF calculation is performed on each independent variable, if it is equal to 1 there is no correlation, from 1 to 5 there is moderate correlation, and greater than 5 there is high correlation.
  • Distribution of error term: The sample error should follow an approximately normal distribution, following the assumptions of the Gauss-Markov theorem, so that the Beta estimators and the sigma found from OLS are unbiased and efficient.
  • Random error term centered at zero: This assumption means that factors not included in the model (comprising the error term) do not systematically affect the mean value of Y, as positive and negative points cancel each other out. When violated, the consistency of the intercept is compromised, but the slope is not affected. The normality of the residual is not strictly necessary, as it will rarely occur in practice, being only a desirable part of this assumption.

Conclusion

This article briefly covered the concepts and assumptions of linear regression, a model often underestimated for its simplicity. Linear regression often outperforms more complex models in its results, as well as being more explainable, which is crucial in both business and research contexts.

Follow my profile and subscribe here to keep up with upcoming posts about other non-linear regression models.

References (ABNT Standard):

G. James et al., An Introduction to Statistical Learning, Springer Texts in Statistics, https://doi.org/10.1007/978-1-0716-1418-1_1

Figueiredo Filho, et al. O que Fazer e o que Não Fazer com a Regressão: pressupostos e aplicações do modelo linear de Mínimos Quadrados Ordinários (MQO), Revista Política Hoje, Vol. 20, n. 1, 2011.

--

--

Alysson Guimarães

Data Scientist. This account is for translated versions of my Portuguese language articles. https://k3ybladewielder.medium.com/