# 3.4: Least Squares Linear Regression

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\avec}{\mathbf a}$$ $$\newcommand{\bvec}{\mathbf b}$$ $$\newcommand{\cvec}{\mathbf c}$$ $$\newcommand{\dvec}{\mathbf d}$$ $$\newcommand{\dtil}{\widetilde{\mathbf d}}$$ $$\newcommand{\evec}{\mathbf e}$$ $$\newcommand{\fvec}{\mathbf f}$$ $$\newcommand{\nvec}{\mathbf n}$$ $$\newcommand{\pvec}{\mathbf p}$$ $$\newcommand{\qvec}{\mathbf q}$$ $$\newcommand{\svec}{\mathbf s}$$ $$\newcommand{\tvec}{\mathbf t}$$ $$\newcommand{\uvec}{\mathbf u}$$ $$\newcommand{\vvec}{\mathbf v}$$ $$\newcommand{\wvec}{\mathbf w}$$ $$\newcommand{\xvec}{\mathbf x}$$ $$\newcommand{\yvec}{\mathbf y}$$ $$\newcommand{\zvec}{\mathbf z}$$ $$\newcommand{\rvec}{\mathbf r}$$ $$\newcommand{\mvec}{\mathbf m}$$ $$\newcommand{\zerovec}{\mathbf 0}$$ $$\newcommand{\onevec}{\mathbf 1}$$ $$\newcommand{\real}{\mathbb R}$$ $$\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$$ $$\newcommand{\laspan}[1]{\text{Span}\{#1\}}$$ $$\newcommand{\bcal}{\cal B}$$ $$\newcommand{\ccal}{\cal C}$$ $$\newcommand{\scal}{\cal S}$$ $$\newcommand{\wcal}{\cal W}$$ $$\newcommand{\ecal}{\cal E}$$ $$\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$$ $$\newcommand{\gray}[1]{\color{gray}{#1}}$$ $$\newcommand{\lgray}[1]{\color{lightgray}{#1}}$$ $$\newcommand{\rank}{\operatorname{rank}}$$ $$\newcommand{\row}{\text{Row}}$$ $$\newcommand{\col}{\text{Col}}$$ $$\renewcommand{\row}{\text{Row}}$$ $$\newcommand{\nul}{\text{Nul}}$$ $$\newcommand{\var}{\text{Var}}$$ $$\newcommand{\corr}{\text{corr}}$$ $$\newcommand{\len}[1]{\left|#1\right|}$$ $$\newcommand{\bbar}{\overline{\bvec}}$$ $$\newcommand{\bhat}{\widehat{\bvec}}$$ $$\newcommand{\bperp}{\bvec^\perp}$$ $$\newcommand{\xhat}{\widehat{\xvec}}$$ $$\newcommand{\vhat}{\widehat{\vvec}}$$ $$\newcommand{\uhat}{\widehat{\uvec}}$$ $$\newcommand{\what}{\widehat{\wvec}}$$ $$\newcommand{\Sighat}{\widehat{\Sigma}}$$ $$\newcommand{\lt}{<}$$ $$\newcommand{\gt}{>}$$ $$\newcommand{\amp}{&}$$ $$\definecolor{fillinmathshade}{gray}{0.9}$$

Your experimental data will usually be examined in graphic form, which can be more instructive than tabulated data. In any graphic presentation, be sure to plot your independent variable on the abscissa (x-axis) and the dependent variable on the ordinate (y-axis).

If the data are presented in graphic form, the usual procedure is to fit the best curve through the experimental points. As Brownlee3 puts it,

"When an investigator observes simultaneously two variables, x and y, usually with good reason he plots the observations on a graph. If there is any sign of an association, he is usually seized with the impulse to fit a line, usually a straight line or, rather infrequently, a parabola or cubic."

When we are "seized with the impulse" to fit a straight line curve the process of curve fitting is called linear regression. We assume that the deviations of the dependent variable y are distributed normally (that is, a Gaussian distribution). This assumption is not required for the determination of the least squares parameters, but it is necessary for construction of confidence intervals or for tests of hypotheses about the parameters. The parameters obtained by least squares under these conditions are unbiased and provide the best estimates of the curve fitting parameters.

Most of the experimental data that you will encounter this semester will have a simple functional form, or can be manipulated in such a way as to assume a simple form. We shall assume here that our data follows a linear relationship. For example, consider a gas that is thought to follow the equation of state

$\mathrm{Z} \equiv \frac{\mathrm{PV}}{\mathrm{nRT}}=\mathrm{B}_{0}+\mathrm{B}_{1} \mathrm{P} \\ \mathrm{y}=\mathrm{a}+\mathrm{bx}$

Measurements of Z, called the compressibility factor, are made at each of a series of pressures in an effort to determine $${B}_{0}$$ and$${B}_{0}$$. The least squares approach can be used to determine the slope (b, or$${B}_{1}$$) and y-intercept (a, or $${B}_{0}$$) which define the best straight line that can be drawn through the data. In simple linear regression analysis, it is assumed that all the error in the measurement lies in the dependent variable (y). This is equivalent to assuming that the precision of the determination of the independent variable (x) is considerable higher than that of y. The measured value of some quantity (yi) will differ from the value predicted by a linear equation ($$y=a+b x$$) by an amount ri called the residual.

${r}_{i} = {y}_{i} -a-bx$

The least squares approach identifies the best straight line as the one for which R, the sum of the squares of the residuals, has the smallest value:

$R=\sum_{i=1}^{N} r_{i}^{2}=\sum_{i=1}^{N}\left(y_{i}-a-b x_{i}\right)^{2}$

The arrows in Figure 4 represent the residuals, and all of them are equally weighted.

We want to minimize the sum of the squares of the residuals. That is, we want to make least the sum of squares of residuals, and, thus, the name of the procedure. Since this is a minimization problem, we require that R be as small as possible with respect to a and b by taking derivatives with respect to these two parameters and setting the derivatives equal to zero.

${ \left( \frac{\partial R}{\partial a}\right) }_{b} =-2 \sum( {y}_{i}-a-b {x}_{i})=0 \\ { \left( \frac{\partial R}{\partial b}\right) }_{a} =-2 \sum(( {y}_{i}-a-b {x}_{i})( {x}_{i})) =0 \label{error4}$

Solving equations \ref{error4} simultaneously, we obtain

$a=\frac{(\sum x_{i}^{2})(\sum y_{i})-(\sum x_{i})(\sum x_{i}y_{i})}{N(\sum x_{i}^{2})-(\sum x_{i})^{2}}$

$b=\frac{N(\sum x_{i}y_{i})-(\sum x_{i})(\sum y_{i})}{N(\sum x_{i}^{2})-(\sum x_{i})^{2}}$

While we ultimately want to fit the data to $$y_{i}=a+b x_{i}$$ for a series i of N points, the theoretical development is much easier if we use $$y_{i}=c+b (x_{i}- \overline{x})$$ where obviously a and c are related by $$a=c-b \overline{x}$$, and where $$\overline{x} \$$ is simply the mean of the x data, $$n \overline{x}= \sum_{i=1}^{N} {x}_{i}$$. This designation of variables is better because "it turns out" that in this approach the parameters b and c are independent, their variances are independent, and the formulas to transform to the parameters a and b are simple. Again, we want to minimize the sum of the squares of the residuals $$r_{i}=y_{i}-c- b( {x}_{i} -\overline{x})$$, R.

$R=\sum_{i=1}^{N} r_{i}^{2}=\sum_{i=1}^{N}[ {y}_{i} - \left( c+b ( {x}_{i} -\overline{x})\right)]^2$

Now we require that R be as small as possible with respect to b and c by taking derivatives with respect to these two parameters and setting the derivatives equal to zero.

${ \left( \frac{\partial R}{\partial c}\right)}_{b}=-2 \sum \left({y}_{i} -c-b \left( {x}_{i} -\overline{x}\right)\right)=0 \\ { \left( \frac{\partial R}{\partial b}\right)}_{c}=-2 \sum \left( \left({y}_{i} -c-b\left( {x}_{i} -\overline{x}\right) \right)\left( {x}_{i} -\overline{x}\right)\right)=0$

These two equations are readily rewritten as

$\sum {y}_{i} = \sum c + \sum b \left( {x}_{i} -\overline{x} \right) \\ \sum {y}_{i} \left( {x}_{i} -\overline{x} \right) = \sum c \left( {x}_{i} -\overline{x} \right) + \sum b \left( {x}_{i} -\overline{x} \right)^2$

which, since $$\sum \left( {x}_{i} -\overline{x} \right) = 0$$, simplify to

$c = \frac{\sum {y}_{i} }{N} = -\overline{y} \\ b = \frac{\sum {y}_{i} \left( {x}_{i} -\overline{x}\right)}{\sum \left( {x}_{i} -\overline{x}\right)^2}$

The knowledge of c and b then allows us to determine the parameter, a, $$a = c - b \overline{x} = \overline{y} - b \overline{x}$$, a particularly simple relation.

The variance and standard deviation of N data points was introduced previously. The appropriate treatment of the parameters c and b in the least squares procedure results in

$V \left[ c\right] = \frac{s^2}{N} \\ V \left[ b\right] = \frac{s^2}{\sum \left( {x}_{i} -\overline{x} \right)^2}$

where

$s^2 = \frac{1}{N-2} \sum \left({y}_{i} -c-b \left( {x}_{i} -\overline{x}\right) \right)^2 = \frac{1}{N-2} \sum \left({y}_{i} -a-b {x}_{i} \right)^2 = \frac{{R}_{min} }{N-2}$

Because c and b are independent quantities, it can be shown that

$V \left[ a \right] = V \left[ c \right] + \overline{x}^2 V \left[ b \right] \\ = \frac{s^2}{N} + \overline{x}^2 \frac{s^2}{\sum \left( {x}_{i} -\overline{x}\right)^2 }$

The standard deviations of a and b, $${ \sigma}_{a}$$ and $${ \sigma}_{b}$$, are then obtained by taking the square roots of V[a] and V[b].

Another statistic that is sometimes used is the correlation coefficient, r. It is given by the expression

$r=\frac{N \sum x_{i} y_{i}-\sum x_{i} \sum y_{i}}{\sqrt{\left(N \sum x_{i}^{2}-\left(\sum x_{i}\right)^{2}\right)\left(N \sum y_{i}^{2}-\left(\sum y_{i}\right)^{2}\right)}}$

When the absolute value of r is close to unity the correlation between the y and x data is good, when it is close to zero the correlation is poor. Because the correlation coefficient is sometimes hard to interpret, it is usually better to work with the standard deviations of the regression parameters, $${ \sigma}_{a}$$ and $${ \sigma}_{b}$$.

Linear regression is a powerful tool that takes the guesswork out of obtaining best fit information. However, like all mathematical tools, it must be used with caution. While the standard deviations and the correlation coefficient give indications of the goodness of the fit, there is no substitute for graphing the data and looking at the result. Since the errors are assumed to be random, the yi values should be scattered about the best fit straight line in a random fashion. The residuals should likewise be either positive or negative (and sum to zero) without any pattern. If the data show a pattern of curvature, it is possible that the results do not conform to the linear model proposed. A linear regression is not valid in this case. You should also be alert to the effects of systematic errors, which may change the slope or the intercept without affecting the linearity of the data.

Linear regression of the type discussed above relies on a number of critical assumptions. Among the most important is that all the uncertainty in each of the y values is the same. If the form of the equation required to provide a linear relationship produces an independent variable that contains a significant uncertainty, the set of conditions that produced the minimum in the residuals of the line may not be the set that actually minimizes the uncertainty in the data. In this case, a "non-linear" least squared technique such as simplex is recommended.

##### Note

In this course, you can use the MatLab lsq.m script to perform least squares analysis. The equations given above are included in the lsq script.

## Weighted Least Squares Analysis

When measurements of the dependent variable incur different uncertainties, a weighted least squares technique should be used. In this approach, each residual ri is divided by a factor proportional to its uncertainty $$\varepsilon_{y_{i}}$$. Then R, the sum of the squares of the residuals, becomes

$R = \sum_{i=1}^{N}\frac{r_{i}^2}{\varepsilon _{y_{i}^2}} = \sum_{i=1}^{N}\frac{1}{\varepsilon _{y_{i}^2}}\left ( y_{i}-a-bx_{i} \right )^2$

Using the same minimization techniques described above, the slope and intercept for weighted least squares become

$a=\frac{\sum \frac{x_{i}^{2}}{\varepsilon_{y_{i}}^{2}} \sum \frac{y_{i}}{\varepsilon_{y_{i}}^{2}}-\sum \frac{x_{i}}{\varepsilon_{y_{i}}^{2}} \sum \frac{x_{i} y_{i}}{\varepsilon_{y_{i}}^{2}}}{\sum \frac{1}{\varepsilon_{y_{i}}^{2}} \sum \frac{x_{i}^{2}}{\varepsilon_{y_{i}}^{2}}-\left(\sum \frac{x_{i}}{\varepsilon_{y_{i}}^{2}}\right)^{2}}$

$b=\frac{\sum \frac{1}{\varepsilon_{y_{i}}^{2}} \sum \frac{x_{i}{y_{i}}}{\varepsilon_{y_{i}}^{2}}-\sum \frac{x_{i}}{\varepsilon_{y_{i}}^{2}} \sum \frac{y_{i} }{\varepsilon_{y_{i}}^{2}}}{\sum \frac{1}{\varepsilon_{y_{i}}^{2}} \sum \frac{x_{i}^{2}}{\varepsilon_{y_{i}}^{2}}-\left(\sum \frac{x_{i}}{\varepsilon_{y_{i}}^{2}}\right)^{2}}$

Weighted least squares has utility in a number of applications in the physical chemistry laboratory. One common situation arises when a linear equation requires the log of a measured quantity to be the dependent variable. Consider for example, the measurement of vapor pressure. Most pressure sensing devices will provide a constant error in the measurement of P. If the dependent variable to be plotted is lnP, however, its uncertainty is not constant but decreases as P increases. Thus,

$\varepsilon_{lnP} = \frac{\partial lnP}{\partial P}\varepsilon_{P} = \frac{\varepsilon_{P}}{P}$

An appropriate weighting factor in this case would be $$\varepsilon_{y_{i}}=\frac{1}{P}$$.

When you use Matlab for data analysis this semester, the command lsq(x,y) will be used for least squares analysis of x, y data, and the command wtlsq(x,y) will be used for weighted least squares analysis. In the latter case, you will be asked for a weighting factor, e. In either case, the program will output the least squares slope, y-intercept, correlation coefficient, standard deviation of the slope, and standard deviation of the intercept.

This page titled 3.4: Least Squares Linear Regression is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Kathryn Haas.