| Module 3: Interpreting Data
5.3 Scatter plots
5.3.5 The line of best fit (regression line)
The scatterplots above have a trend line fitted. It is thus easier to visual and summarise the direction of the data.
The line of best fit is drawn by:
- having the same number of data points on each side of the line - i.e., the line is in the median position;
- NOT going from the first data to the last data - since extreme data often deviate from the general trend and this will give a biased sense of direction.
Note: In univariate data sets we always calculated two summary numbers - one for centre of the data set, and the other for the spread of the data set. We have a similar situation in bivariate data sets.
1. Measure of centre. The regression line is in the median position - it is the analogue in two dimensions of the median in univariate data. You will remember how the median is the middle point of a ranked data set.
2. Measure of spread. The closeness (or otherwise) of the cloud of data points to the line suggests the univariate concept of spread or dispersion.
The graph below shows what happens when we draw the line of best fit from the first data to the last data - it does not go through the median position as there is one data above and three data below the blue line. This is a common mistake to avoid!
To determine the equation for the line of best fit:
1. draw the scatterplot on a grid and draw the line of best fit;
2. select two points on the line which are, as near as possible, on grid intersections so that you can accurately estimate their position;
3. calculate the gradient (B) of the line using the formula
4. write the partial equation;
5. substitute one of the chosen points into the partial equation to evaluate the "A" term;
6. write the full equation of the line.
Example:
Consider the data in the graph below:
To determine the equation for the line of best fit:
- a computer application has calculated and plotted the line of best fit for the data - it is shown as a black line - and it is in the median position with 3 data on one side and 3 data on the other side;
- the two points chosen on the line are (50, 700) and (110, 1100);
- calculate the gradient (B) of the line using the formula
- substitute the point (50, 700) into the equation:
- write the full equation of the line:
.
Interesting fact: The point ALWAYS lies on the line of best fit!
The following graph shows a line which accurately describes the direction of the data, but it is not the line of best fit. It is too far away from the median position - it has only two data points underneath but has 4 data points above.
So it does not split the data cloud into two equal parts.
In fact, the mathematical process which determines the unique line of best fit is based on what is called the method of least squares - which explains why this line is sometimes called the least squares line.
This method works by:
1. finding the difference of each data Y value from the line;
2. squaring all the differences;
3. summing all the squared differences;
4. repeating this process for all positions of the line until the smallest sum of squared differences is reached;
5. When this is done, we have found the correct position (correct A and B) for the unique line of best fit
The vertical distance (written with the symbol e and called the "residual") is what is squared and summed to all other similar residuals, until each data point has been used.
You may have wondered why the line equation is written with a "hat" on top of the Y variable. This is because any Y value which is calculated by substitution into the equation is not a "real data value". The point lies above the data point . is on the line, but usually is not - because most data points don't lie on the line of best fit - these surround it.
TEST EXAMPLE
Estimate the equation of the line of best fit for the following data sets:
|