Defining the Regression Statistical Model

Regression analyzes relationships between variables

Illustration of a pick axe surrounded by electronic connections representing data mining


ArtHead- / Getty Images

Regression is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset. For example, regression might be used to predict the cost of a product or service, given other variables.

Regression is used across multiple industries for business and marketing planning, financial forecasting, environmental modeling and analysis of trends.

Regression vs. Classification

Regression and classification are data mining techniques used to solve similar problems, but they are frequently confused. Both are used in prediction analysis, but regression is used to predict a numeric or continuous value while classification assigns data into discrete categories.

For example, regression would be used to predict a home's value based on its location, square feet, price when last sold, the price of similar homes, and other factors. Classification would be in order if you wanted to instead organize houses into categories, such as walkability, lot size or crime rates.

Types of Regression Techniques

The simplest and oldest form of regression is linear regression used to estimate a relationship between two variables. This technique uses the mathematical formula of a straight line (y = mx + b). In plain terms, this simply means that, given a graph with a Y and an X-axis, the relationship between X and Y is a straight line with few outliers. For example, we might assume that, given an increase in population, food production would increase at the same rate — this requires a strong, linear relationship between the two figures. To visualize this, consider a graph in which the Y-axis tracks population increase, and the X-axis tracks food production. As the Y value increases, the X value would increase at the same rate, making the relationship between them a straight line.

Advanced techniques, such as multiple regression, predict a relationship between multiple variables — for example, is there a correlation between income, education and where one chooses to live? The addition of more variables considerably increases the complexity of the prediction. There are several types of multiple regression techniques including standard, hierarchical, setwise and stepwise, each with its own application.

At this point, it's important to understand what we are trying to predict (the dependent or predicted variable) and the data we are using to make the prediction (the independent or predictor variables). In our example, we want to predict the location where one chooses to live (the predicted variable) given income and education (both predictor variables).

  • Standard multiple regression considers all predictor variables at the same time. For example 1) what is the relationship between income and education (predictors) and choice of neighborhood (predicted); and 2) to what degree do each of the individual predictors contribute to that relationship?
  • Stepwise multiple regression answers an entirely different question. A stepwise regression algorithm will analyze which predictors are best used to predict the choice of neighborhood — meaning that the stepwise model evaluates the order of importance of the predictor variables and then selects a relevant subset. This type of regression problem uses "steps" to develop the regression equation. Given this type of regression, all predictors may not even appear in the final regression equation. 
  • Hierarchical regression, like stepwise, is a sequential process, but the predictor variables are entered into the model in a pre-specified order defined in advance, i.e. the algorithm does not contain a built-in set of equations for determining the order in which to enter the predictors. This is used most often when the individual creating the regression equation has expert knowledge of the field.
  • Setwise regression is also similar to stepwise but analyzes sets of variables rather than individual variables.



Was this page helpful?