What is regression in statistics?

Regression in statistics is a useful way of making predictions about data. Regression shows the most likely outcome based on a trend of one or more known data points (predictors) - and the impact of changing one of the predictors.

For example, we might predict the most likely grade point average (GPA) students would earn in college based on their annual aptitude scores in high school.

The more technical definition of regression is the strength of relationship between one or more independent data points (variables we can change or predictors) and one dependent data point (the predicted outcome).

Continuing with the example above, regression allows us to estimate how much higher students’ college GPA might be if their aptitude scores were improved by 2 points every year.

Regression example

Let’s look at another regression example. Suppose you want to predict how much money you could make by investing in mutual funds over the next 10 years. The known data points (predictors or variables you can change) in this example would be how much money you contribute, how frequently you contribute, and the past performance of the mutual funds. By adjusting any one of those variables, you can predict how your return on investment may increase or decrease.

Regression equation and regression line

Regression equation and regression line are two important terms to know. A regression line is the trend that emerges when we plot our known data (predictors) on a graph. And the way we plot our data is by using a regression equation - a mathematical formula where we can plug in our known data to calculate the predicted outcome. There are different types of regression equations, but the most common one is the linear regression equation. (Learn more about regression equations here.)

By using a regression equation to visualize our data on a graph, we can more easily see how the outcome might change when one or more predictors change.

Types of regression

For different types of data, there are different types of regression. (Defining the data types in the context of regressions exceeds the depth of this explanation, but you can learn more about both types here.)

The types of regression include linear, logistic, ridge, LASSO, polynomial, Bayesian, and others. At a very high level, the difference between regressions is the shape or arc of the regression line. Basically, each regression has a different shape when visualized on a graph (reflecting the varied patterns of the data being visualized).

(It’s worth noting that each type of regression has its own equation.)

What is the regression fallacy?

The regression fallacy, more commonly called regression toward the mean, is when something happens that’s unusually good or bad reverts back towards the average (i.e. regresses toward the mean). This statistical fallacy occurs anywhere random chance plays a part in the outcome.

For example, success in business is often a combination of both skill and luck. This means that the best-performing companies today are likely to be much closer to average in 10 years time, not through incompetence but because today they’re likely benefitting from a string of good luck – like rolling a double-six repeatedly.

Regression vs. correlation

Correlation shows what, if any, relationship exists between two data points. (Learn more about correlation here.)

Regression involves causation where one piece of information (outcome) is the effect of one or more other data points. Also, regression allows us to ‘play’ with the outcome by changing the independent data.

For example, we could see how fluctuating oil costs would impact gasoline prices.

Additional resources to learn more about regressions