Method of Linear Regression

In the case that one believes that a series of two variables correlate linearly with each other, the method of least squares may be used to find the "best" straight line through the points. The method which follows assumes that one "knows" the variable on the x-axis more accurately than the variable on the y-axis. The y-axis variable is often referred to as the dependent variable and the x-axis variable the independent variable. Where a mathematical function y=f(x) is being considered, one might say that the value of x determines the value of y. Where y represents values measured with an instrument and there is only presumed to be a relationship between x and y, not only would one anticipate favorably that such a relationship will be exhibited, but that relationship might also be expected to be somewhat muddled by possible biases and random errors typical of the instrument which measured the y values and those typical of some other instrument responsible for establishing the x values.

In colorimetry for example the x-axis variable is concentration of a known solution and the y-axis variable is a measured absorbance of that solution. Once the relationship is established the absorbance of an unknown solution is measured and the line representing the relationship between the two variables can then be used to determine the concentration of the unknown. A melting point curve would show concentration or mole fraction or w/w % on the x-axis and melting point on the y-axis. There are many cases though in which a distinguishing feature such as knowing the x-axis variable more accurately is not clear or is not followed. A pressure/volume diagram is one in which both variables might be known with equal precision. When calibrating a buret, the volume customarily would be shown across the x-axis and the "corrected volume" obtained from a mass measured on an analytical balance would appear along the y-axis, even though the mass can be determined to 4-5 significant figures and the volume only to 3-4. In any case, one says for the method described below, that it is the y-axis variable which has a measurable error and the "residuals" or differences in a vertical direction between each measured y value and the best straight line between all the points are taken into account for this method. The method is to find m (the slope) and b (the y-intercept) for a relationship given by

$y = mx + b$

Five intermediate quantities are defined for the convenience of calculating various values associated with a least squares linear regression in two variables. Seven useful results can be calculated from these five intermediate quantities but for the purpose of this discussion only three will be shown: the method of finding m, the method of finding the (y-intercept) and the method of finding the standard deviation about the regression line. "N" in each equation below represents the number of xi and yi pairs, or the number of measurements.

The Five Intermediate Quantities

$S_{xx} = \sum_i x_i^2 - \dfrac{\left(\sum x_i\right)^2}{N}$

$S_{yy} = \sum_i y_i^2 - \dfrac{\left(\sum y_i\right)^2}{N}$

$S_{xy} = \sum_i x_ix_i - \dfrac{\left(\sum x_i\right)\left(\sum y_i\right)}{N}$

$\bar{x} = \dfrac{\sum x_i}{N}$

$\bar{y} = \dfrac{\sum y_i}{N}$

Three of Seven Useful Results

The slope m may be calculated using the formula

$m =\dfrac{S_{xy}}{S_{xx}}$

The (y-intercept) may be calculated using the formula

$b = \bar{y} - m \bar{x}$

The standard deviation $$s_r$$ about the regression line may be calculated using the formula

$s_r = \sqrt{\dfrac{S_{yy} -m^2 S_xx}{N-2}}$

A number of calculators have built-in software to obtain these results. The process is often referred to as "linear regression" in calculator manuals. Spread sheet programs also offer this feature.