Find top 1-on-1 online tutors for Coding, Math, Science, AP and 50+ subjects

Table of Contents


Residuals and Correlation

Residuals and correlation concepts are very important and widely used statistics terms that are related to linear regression. Linear regression is a method of analysis that is used to predict a dependent variable based on the value of an independent variable.

The regression line is the best way to represent data in a scatter plot.

An estimate of the line that depicts the actual, but unidentified, linear connection between the two variables is called a regression line. When the value of the explanatory variable is known, the regression line’s equation is used to predict the value of the response variable.

The correlation coefficient and the residuals are some of the important technical measurements that are related to the regression line of data.

What are residuals?

The difference between the observed value and the mean value that a model predicts for each observation is residuals. Every observation will have a residual in a regression line. In statistical modeling, the regression line model is used to calculate the residuals.

The vertical distance from the observation to the regression line is the residual. The observations above the line would be considered positive residuals and similarly, the observations below the line would be negative residuals. We can say that the sum of the fit and residual would be the whole data.

Regression residual


The difference between a data point’s actual value and the value that the regression line would have predicted for that identical data point is known as the residual.

Therefore, Residual=actual-predicted.

The regression line is represented by \mathop y\limits^\^and the actual value would be the yvalue on the respective scale.

Hence, the formula to calculate the residual for an observation is:

    \[e = y - \mathop y\limits^\^ \]

The sum of the residuals is always 0, i.e., \sum e  = 0.

And the mean of the residuals is also 0, i.e., \mathop e\limits^\_  = 0.

  • Calculating the residual for an observation

1. Compute the predicted value\mathop y\limits^\^ for the point to be calculated.

2. Calculate the difference using the formula.

Residual plots

They are a visual representation that is used to validate the regression models.

The residual plots have the independent variables on the x-axis and the calculated residual values on the y-axis.

ap statistics practice tests and past papers download

Analysis of the residual plots

We can analyze the residual plots and if they show characteristics of a good residual plot, we can validate the linear regression model of the same data.

A good residual plot has the following characteristics:

  • The residuals are independent and normally distributed

As we move along the x-axis if we do not see any pattern, then it would mean that the residuals are independent.

And if we project the values onto the y-axis, it should show a normally distributed curve.

  • It has a low density of points far from the origin compared to a high density of points nearby.
  • It is symmetric about the origin.


Definition: A statistical metric known as correlation describes how closely two variables are connected linearly. It’s a typical technique for expressing straightforward connections without explicitly stating cause and consequence.

There are four types of correlation:

  • Positive correlation: The positive linear correlation is when the variables on both the axes i.e., the x-axis and y-axis simultaneously increase thus resulting in the regression line having an upward slope.
  • Negative correlation: The negative linear correlation is when either one of the variables decreases as the other increases, therefore resulting in a downward slope of the regression line.
  • Non-linear correlation: There is a defined relationship between the two variables but the relationship is not linear.
  • No-correlation: The two variables do not have any visible or viable relation or pattern detected.

Calculating the correlation using Pearson’s correlation coefficient

Statistics defines Pearson’s correlation coefficient, often known as Pearson’s correlation coefficient or Pearson’s r, as the assessment of the strength of the link and association between two variables.

Pearson’s correlation coefficient formula:

    \[r = \frac{{N\sum {xy - \left( {\sum x } \right)\left( {\sum y } \right)} }}{{\sqrt {\left[ {N\sum {{x^2}}  - {{\left( {\sum x } \right)}^2}} \right]\left[ {N\sum {{y^2}}  - {{\left( {\sum y } \right)}^2}} \right]} }}\]


N = the number of pairs of scores.

\sum {xy} = the sum of the products of paired scores.

\sum x= the sum of x scores.

\sum y = the sum of y scores.

\sum {{x^2}} = the sum of squared x scores.

\sum {{y^2}} = the sum of squared y scores.

Residuals and correlation in regression analysis

The correlation coefficient is used to find how strong the relationship is between the variables xand y.

The formula used to find the coefficient relation in linear regressionr is:


    \[r = \frac{1}{{n - 1}}\sum {\left( {\frac{{{x_i} - \mathop x\limits^\_ }}{{{s_x}}}} \right)\left( {\frac{{{y_i} - \mathop y\limits^\_ }}{{{s_y}}}} \right)} \]

\mathop x\limits^\_is the mean of x.

\mathop y\limits^\_is the mean of y.

{s_x}is the standard deviation ofx.

{s_y}is the standard deviation ofy.

is the number of observations.

The correlation coefficient always lies between 1 and -1. If the value is closer to 0 the relationship is weaker and if it’s between 0.5 and 1, or -0.5 and -1 the relationship between the variables is strong.

  • Residuals in Nonlinear Regression

The residuals are calculated the same way as in the linear regression i.e., the perpendicular distance between the actual and the predicted values.

  • Residuals in Time Series Analysis

The residuals between the fitted values and the observedyand \mathop y\limits^\^are defined as:

\begin{array}{l}{e_t} = {y_t} - \mathop {{y_t}}\limits^\^ \\ = {y_t} - \mathop {{\beta _0}}\limits^\^  - \mathop {{\beta _1}{x_{1,t}}}\limits^\^  - \mathop {{\beta _2}{x_{2,t}}}\limits^\^  - ............. - \mathop {{\beta _k}}\limits^\^ {x_{k,t}}\end{array}

We, also have important properties as follows:

1. \sum\limits_{t = 1}^T {{e_t}}  = 0

2. \sum\limits_{t = 1}^T {{x_{k,t}}{e_t}}  = 0

These two properties are for all k.

Common pitfalls in interpreting residuals and correlation

  • Correlation can be used only when there is a relation that is linear. A linear relationship is what the correlation coefficient seeks. As a result, it may be mistaken when two variables actually have a link, but that relationship is nonlinear.
  • All observations are supposed to be independent of one another for the purposes of correlation analysis. As a result, it shouldn’t be applied when there are several observations of a single individual in the data.
  • Only data on a continuous scale are appropriate for linear correlation analysis. When one or both variables have been assessed using an ordinal scale, it should not be utilized.
  • We also cannot see a correlation when there is between the variable and one of its elements (components).
  • We have to avoid multicollinearity. Multicollinearity is the term used to describe the relationship that develops between two or more predictor variables. This is undesirable since it adds redundancy to the model. Because the coefficient matrix Xcannot reach full Rank in this situation, the least squares optimization technique is impacted.
  • Another common mistake is heteroscedasticity, which is the Non-Constant Variance of Error Terms. To learn more about this, we may plot the residuals (often in a Standardized format) against the Predicted Values.

Applications of residuals and correlation

  • Business leaders may make more meaningful predictions based on trends in data with the help of correlation and regression analysis. Through the use of this approach, company operations may be enhanced as well as management, customer experience initiatives, and performance can be guided appropriately.
  • Correlation and residuals can be used in any relationship between activities and their impacts such as height and weight, time spent exercising and body fat, abuse of substance and intelligence, temperature and amount of electricity used, etc.


In this article, we learned about how residuals and correlations are a very important part of the analysis of linear regression and how we can statistically use them to find the probable values from the observed values. We learned about the different types of correlation, Pearson’s correlation coefficient, and how correlation can be used in linear regression. They are important tools that are used to estimate values in many real-life analyses. 

Sample examples

Example 1: Find the correlation coefficient for the given data set.


Solution 1:

We need to find the means of x and y.

    \[\mathop x\limits^\_  = \frac{{2 + 4 + 6 + 8 + 10}}{5} = 6\]


    \[\mathop y\limits^\_  = \frac{{0.6 + 1.0 + 0.4 + 0.4 + 2.0}}{5} = 0.88\]

Now we find the standard deviations {s_x}and{s_y}.

    \[{s_x} = \sqrt {\frac{{\sum\limits_{i = 1}^5 {{{\left( {{x_i} - \mathop x\limits^\_ } \right)}^2}} }}{5}}  = \sqrt {\frac{{16 + 4 + 0 + 4 + 16}}{5}}  = \sqrt 8  = 2\sqrt 2 \]

    \[{s_y} = \sqrt {\frac{{\sum\limits_{i = 1}^5 {{{\left( {{y_i} - \mathop y\limits^\_ } \right)}^2}} }}{5}}  = \sqrt {\frac{{0.0784 + 0.0144 + 0.2304 + 0.2304 + 1.2544}}{5}}  = \sqrt {0.3616}  = 0.601\]

Now we use the formula

    \[r = \frac{1}{{n - 1}}\sum {\left( {\frac{{{x_i} - \mathop x\limits^\_ }}{{{s_x}}}} \right)\left( {\frac{{{y_i} - \mathop y\limits^\_ }}{{{s_y}}}} \right)} \]

to calculate the correlation coefficient and substitute the values calculated.


    \[\begin{array}{l}r = \frac{1}{{5 - 1}}\sum {\left( {\frac{{{x_i} - \mathop x\limits^\_ }}{{{s_x}}}} \right)\left( {\frac{{{y_i} - \mathop y\limits^\_ }}{{{s_y}}}} \right)} \\ = \frac{1}{4}\left[ {\left( {\frac{{2 - 6}}{{2\sqrt 2 }}} \right)\left( {\frac{{0.6 - 0.88}}{{0.601}}} \right)} \right] + \left[ {\left( {\frac{{4 - 6}}{{2\sqrt 2 }}} \right)\left( {\frac{{1.0 - 0.88}}{{0.601}}} \right)} \right] + \left[ {\left( {\frac{{6 - 6}}{{2\sqrt 2 }}} \right)\left( {\frac{{0.4 - 0.88}}{{0.601}}} \right)} \right] + \left[ {\left( {\frac{{8 - 6}}{{2\sqrt 2 }}} \right)\left( {\frac{{0.4 - 0.88}}{{0.601}}} \right)} \right] + \left[ {\left( {\frac{{10 - 6}}{{2\sqrt 2 }}} \right)\left( {\frac{{2.0 - 0.88}}{{0.601}}} \right)} \right]\\r = \frac{1}{4}\left[ {\left( {\frac{1}{{2\sqrt 2 }}} \right)\left( {\frac{1}{{0.601}}} \right)\left( {1.12 - 0.24 + 0 - 0.96 + 4.48} \right)} \right]\\ = \frac{1}{4}\left( {\frac{1}{{2\sqrt 2 }}} \right)\left( {\frac{1}{{0.601}}} \right)\left( {4.4} \right)\\ = 0.647\\.\end{array}\]

The correlation coefficient is 0.647

Example 2: Find the residuals for the data given and make a scatterplot with the residual values. (Residual plot)


Solution 2:

We use the formula

    \[e = y - \mathop y\limits^\^ \]


Therefore, the residuals for each observation would be


Plotting the calculated values against xwe get the above plot.

Example 3: For a linear fit \mathop y\limits^\^  = 0.45x + 63, calculate the residual for the observation (40,82.1).

Solution 3:

First, we find the value of \mathop y\limits^\^which is 0.45 \times 40 + 63 = 81

The given value of yis 82.1.

Using the formula

    \[e = y - \mathop y\limits^\^ \]

,we get the value of eto be 82.1 - 81 = 1.1

Therefore, the residual value is 1.1

Example 4: For a linear regression \mathop y\limits^\^  = 4x + 5 find the residuals for the observations (1,2) and (6,4).

Solution 4:

First, we find the value of \mathop y\limits^\^ which is  4 \times 1 + 5 = 9for the first point and4 \times 6 + 5 = 29

The given value ofyis 2 for the first observation and 4 for the second one.

Using the formula

    \[e = y - \mathop y\limits^\^ \]


Observation 1-7
Observation 2-25

Example 5: Find the residuals of the yellow observation and the green observation from the graph given below:

Solution 5:

Using the formula

    \[e = y - \mathop y\limits^\^ \]

For the yellow observation, the value of yis 8 and the value of\mathop y\limits^\^is 4.

Therefore, eis 4.

For the green observation, the value of yis 3 and the value of \mathop y\limits^\^is 5.

Therefore, eis -2.

Frequently asked questions (FAQs)

What is the normal distribution?

An example of a continuous probability distribution is the normal distribution, in which the majority of data points cluster in the middle of the range while the remaining ones taper off symmetrically toward either extreme.

What is an independent variable?

The cause in a study is the independent variable. Its value is unaffected by other research factors.

What is a scatter plot?

The graphs that show the association between two variables in a data collection are called scatter plots. It displays data points either on a Cartesian system or a two-dimensional plane. The X-axis is used to represent the independent variable or characteristic, while the Y-axis is used to plot the dependent variable.

What is linear regression?

A variable’s value can be predicted using linear regression analysis based on the value of another variable. The dependent variable is the one you want to be able to forecast. The independent variable is the one you’re using to make a prediction about the value of the other variable.

What is the time series?

A time series is a group of observations of clearly defined data points produced over time via repeated measurements.


Whitley, E., & Ball, J. (2002). Statistics review 2: Samples and populations. Critical Care, 6(2), 1-6.

Cox, D. R., & Snell, E. J. (1968). A general definition of residuals. Journal of the Royal Statistical Society: Series B (Methodological), 30(2), 248-265.

Zou, K. H., Tuncali, K., & Silverman, S. G. (2003). Correlation and simple linear regression. Radiology, 227(3), 617-628.

Get 1-on-1 online AP Statistics tutor