14.2: Verifying the Method

Last updated
Save as PDF

Page ID: 220783

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

After developing and optimizing a method, the next step is to determine how well it works in the hands of a single analyst. Three steps make up this process: determining single-operator characteristics, completing a blind analysis of standards, and determining the method’s ruggedness. If another standard method is available, then we can analyze the same sample using both the standard method and the new method, and compare the results. If the result for any single test is unacceptable, then the method is not a suitable standard method.

Single Operator Characteristics

The first step in verifying a method is to determine the precision, accuracy, and detection limit when a single analyst uses the method to analyze a standard sample. The detection limit is determined by analyzing an appropriate reagent blank. Precision is determined by analyzing replicate portions of the sample, preferably more than ten. Accuracy is evaluated using a t-test to compare the experimental results to the known amount of analyte in the standard. Precision and accuracy are evaluated for several different concentrations of analyte, including at least one concentration near the detection limit, and for each different sample matrix. Including different concentrations of analyte helps to identify constant sources of determinate error and to establish the range of concentrations for which the method is applicable.

Blind Analysis of Standard Samples

Single-operator characteristics are determined by analyzing a standard sample that has a concentration of analyte known to the analyst. The second step in verifying a method is a blind analysis of standard samples. Although the concentration of analyte in the standard is known to a supervisor, the information is withheld from the analyst. After analyzing the standard sample several times, the analyte’s average concentration is reported to the test’s supervisor. To be accepted, the experimental mean must be within three standard deviations—as determined from the single-operator characteristics—of the analyte’s known concentration.

An even more stringent requirement is to require that the experimental mean be within two standard deviations of the analyte’s known concentration.

Ruggedness Testing

An optimized method may produce excellent results in the laboratory that develops a method, but poor results in other laboratories. This is not particularly surprising because a method typically is optimized by a single analyst using the same reagents, equipment, and instrumentation for each trial. Any variability introduced by different analysts, reagents, equipment, and instrumentation is not included in the single-operator characteristics. Other less obvious factors may affect an analysis, including environmental factors, such as the temperature or relative humidity in the laboratory; if the procedure does not require control of these conditions, then they may contribute to variability. Finally, the analyst who optimizes the method usually takes particular care to perform the analysis in exactly the same way during every trial, which may minimize the run-to-run variability.

An important step in developing a standard method is to determine which factors have a pronounced effect on the quality of the results. Once we identify these factors, we can write specific instructions that specify how these factors must be controlled. A procedure that, when carefully followed, produces results of high quality in different laboratories is considered rugged. The method by which the critical factors are discovered is called ruggedness testing [Youden, W. J. Anal. Chem. 1960, 32(13), 23A–37A].

For example, if temperature is a concern, we might specify that it be held at \(25 \pm 2\)^oC.

Ruggedness testing usually is performed by the laboratory that develops the standard method. After identifying potential factors, their effects on the response are evaluated by performing the analysis at two levels for each factor. Normally one level is that specified in the procedure, and the other is a level likely encountered when the procedure is used by other laboratories.

This approach to ruggedness testing can be time consuming. If there are seven potential factors, for example, a 2⁷ factorial design can evaluate each factor’s first-order effect. Unfortunately, this requires a total of 128 trials—too many trials to be a practical solution. A simpler experimental design is shown in Table \(\PageIndex{1}\), in which the two factor levels are identified by upper case and lower case letters. This design, which is similar to a 2³ factorial design, is called a fractional factorial design. Because it includes only eight runs, the design provides information only the average response and the seven first-order factor effects. It does not provide sufficient information to evaluate higher-order effects or interactions between factors, both of which are probably less important than the first-order effects.

Table \(\PageIndex{1}\). Experimental Design for a Ruggedness Test Involving Seven Factors
run	A	B	C	D	E	F	G	response
1	A	B	C	D	E	F	G	R₁
2	A	B	c	D	e	f	g	R₂
3	A	b	C	d	E	f	g	R₃
4	A	b	c	d	e	F	G	R₄
5	a	B	C	d	e	F	g	R₅
6	a	B	c	d	E	f	G	R₆
7	a	b	C	D	e	f	G	R₇
8	a	b	c	D	E	F	g	R

The experimental design in Table \(\PageIndex{1}\) is balanced in that each of a factor’s two levels is paired an equal number of times with the upper case and lower case levels for every other factor. To determine the effect, E, of changing a factor’s level, we subtract the average response when the factor is at its upper case level from the average value when it is at its lower case level.

\[E = \frac {\left( \sum R_i \right)_\text{upper case}} {4} - \frac {\left( \sum R_i \right)_\text{lower case}} {4} \label{14.1}\]

Because the design is balanced, the levels for the remaining factors appear an equal number of times in both summation terms, canceling their effect on E. For example, to determine the effect of factor A, E_A, we subtract the average response for runs 5–8 from the average response for runs 1–4. Factor B does not affect E because its upper case levels in runs 1 and 2 are canceled by the upper case levels in runs 5 and 6, and its lower case levels in runs 3 and 4 are canceled by the lower case levels in runs 7 and 8. After we calculate each of the factor effects we rank them from largest to smallest without regard to sign, identifying those factors whose effects are substantially larger than the other factors.

To see that this is design is balanced, look closely at the last four runs. Factor A is present at its level a for all four of these runs. For each of the remaining factors, two levels are upper case and two levels are lower case. Runs 5–8 provide information about the effect of a on the response, but do not provide information about the effect of any other factor. Runs 1, 2, 5, and 6 provide information about the effect of B, but not of the remaining factors. Try a few other examples to convince yourself that this relationship is general.

We also can use this experimental design to estimate the method’s expected standard deviation due to the effects of small changes in uncontrolled or poorly controlled factors [Youden, W. J. “Statistical Techniques for Collaborative Tests,” in Statistical Manual of the Association of Official Analytical Chemists, Association of Official Analytical Chemists: Washington, D. C., 1975, p. 35].

\[s=\sqrt{\frac{2}{7} \sum_{i=1}^{n} E_{i}^{2}} \label{14.2}\]

If this standard deviation is too large, then the procedure is modified to bring under control the factors that have the greatest effect on the response.

Why does this model estimate the seven first-order factor effects, E, and not seven of the 20 possible first-order interactions? With eight experiments, we can only choose to calculate seven parameters (plus the average response). The calculation of E_D, for example, also gives the value for E_AB. You can convince yourself of this by replacing each upper case letter with a \(+1\) and each lower case letter with a \(-1\) and noting that \(A \times B = D\). We choose to report the first-order factor effects because they likely are more important than interactions between factors.

Example \(\PageIndex{1}\)

The concentration of trace metals in sediment samples collected from rivers and lakes are determined by extracting with acid and analyzing the extract by atomic absorption spectrophotometry. One procedure calls for an overnight extraction using dilute HCl or HNO₃. The samples are placed in plastic bottles with 25 mL of acid and then placed on a shaker operated at a moderate speed and at ambient temperature. To determine the method’s ruggedness, the effect of the following factors was studied using the experimental design in Table \(\PageIndex{1}\) .

Factor A: extraction time	A = 24 h	a = 12 h
Factor B: shaking speed	B = medium	b = high
Factor C: acid type	C = HCl	c = HNO₃
Factor D: acid concentration	D = 0.1 M	d = 0.05 M
Factor E: volume of acid	E = 25 mL	e = 35 mL
Factor F: type of container	F = plastic	f = glass
Factor G: temperature	G = ambient	g = 25^oC

Eight replicates of a standard sample that contains a known amount of analyte are carried through the procedure. The percentage of analyte recovered in the eight samples are as follows: R₁ = 98.9, R₂ = 99.0, R₃ = 97.5, R₄ = 97.7, R₅= 97.4, R₆ = 97.3, R₇ = 98.6, and R₈ = 98.6. Identify the factors that have a significant effect on the response and estimate the method’s expected standard deviation.

Solution

To calculate the effect of changing each factor’s level we use equation \ref{14.1} and substitute in appropriate values. For example, E_A is

\[E_{A}=\frac{98.9+99.0+97.5+97.7}{4} - \frac{97.4+97.3+98.6+98.6}{4}=0.30 \nonumber\]

Completing the remaining calculations and ordering the factors by the absolute values of their effects

Factor D = 1.30, Factor A = 0.35, Factor E = –0.10, Factor B = 0.05, Factor C = –0.05, Factor F = 0.05, Factor G = 0.00

shows us that the concentration of acid (Factor D) has a substantial effect on the response, with a concentration of 0.05 M providing a much lower percent recovery. The extraction time (Factor A) also appears significant, but its effect is not as important as the acid’s concentration. All other factors appear insignificant. The method’s estimated standard deviation is

\[s = \sqrt{\frac {2} {7} \times \left[ (1.30)^2 + (0.35)^2 + (-0.10)^2 + (0.05)^2 + (-0.05)^2 + (0.05)^2 + (0.00)^2 \right]} = 0.72 \nonumber\]

which, for an average recovery of 98.1% gives a relative standard deviation of approximately 0.7%. If we control the acid’s concentration so that its effect approaches that for factors B, C, and F, then the relative standard deviation becomes 0.18, or approximately 0.2%.

Equivalency Testing

If an approved standard method is available, then a new method should be evaluated by comparing results to those obtained when using the standard method. Normally this comparison is made at a minimum of three concentrations of analyte to evaluate the new method over a wide dynamic range. Alternatively, we can plot the results obtained using the new method against results obtained using the approved standard method. A slope of 1.00 and a y-intercept of 0.0 provides evidence that the two methods are equivalent.