Statistical equivalence and non-inferiority tests

The equivalence problem

The proof of the equality of properties of two groups or the proof that a difference = 0 are not so rarely the subject of clinical studies or experiments. For example, it can be shown that a more cost-effective therapy is just as effective as an expensive one, that a certain physiological characteristic does not differ in two groups, or that in a robustness examination the measured values do not change significantly compared to the undisturbed measurement.

The classic object of investigation is bioequivalence testing for medicinal products. Here it is examined whether the time curves of the blood content characterized by an area or its peak (see figure) do not differ from two different forms of administration of a drug.

Like the difference, one can also consider the quotient of two quantities: then a ratio = 1 is an expression of the equivalence. This is the case with the proof of bioequivalence, for example. Another application is the proof of the equivalence of the diagnostic quality of two diagnostic tests: The equivalence of sensitivity and / or specificity is determined via the ratios rTPF (ratio of "true positive fractions" TPF, TPF = sensitivity) and rFPF (ratio of "false postive") fractions "FPF, FPF = 1 specificity), which is ideally 1.

The non inferiority problem

It should be briefly mentioned a related problem, the proof of non-inferiority. This is considered when one-sided equivalence is of interest (no inferiority), but superiority (and thus one-sided deviation from equivalence) is not a problem or even desired. One example is the detection of the missing carry-over in measuring instruments in laboratory medicine: the only thing of interest here is that samples with a high concentration do not lead to incorrectly increased values of samples measured later (e. g. as a result of carry-over). Another example is the comparison of side effects, these should not be found more often in comparison, but a lower incidence is not a problem.

Procedure for the equivalence test

If you want to show equivalence, you would first assume a difference = 0. But this is the ideal case, in reality you are admitted to a certain area within which a difference is not considered relevant. An equivalence range around 0 is therefore defined, which is limited by equivalence limits.

Statistical evidence can be provided in such a way that the so-called estimator (e.g. an average) is considered together with its range of uncertainty as described by the confidence interval. The confidence interval must lie within the equivalence limits in order to show equivalence (upper figure).

Alternatively, statistical hypothesis tests can be used. Two null hypotheses are formulated here: The mean is below or above the limits (figure mid). The alternative hypothesis, on the other hand, assumes the position of the difference within the range. If both null hypotheses are rejected, the proof has been given.

If mean values are considered, two one-sided t-tests are used (TOST [Schuirmann DJ. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the equivalence of average bioavailability. J of Pharmacokinetics and Biopharmaceutics 1987; 15 (6): 657-680]).The term TOST is now widely used as a synonym for equivalence tests in general, although strictly speaking it refers to t-tests and mean value comparisons. The TOST can now be found in many software programs. The term TOST means the equivalence test approach.

Note: On our website, you can download two excel-tools for the application of the TOST equivalence test, including tabulation of the necessary sample sizes (see figure right-->).

Establishing the equivalence limits

The main problem when performing the equivalence tests is the prospective (!) Establishment of the equivalence limits. First of all: this is a substantive and no statistical Question. Nevertheless, the establishment of the equivalence limits takes a lot of space in the statistical advice a. Large areas result in a smaller sample size, and verification is easier. On the other hand, the validity of the proof can be limited, and it is e.g. B. not accepted by the authorities.

One can approach this question by the following considerations:

Which difference is not relevant?
"A difference that makes no difference."
What is the minimally interesting difference (MID) - the equivalence range should be a bit smaller, e.g. B. 0.7 times.
How large is the measurement uncertainty or the biological variability - here too the equivalence range should be smaller.

In the area of bioequivalence studies, the limits are set by the authorities: "Decision in favor of bioequivalence will be accepted when the parametric confidence intervals do not exceed the limits of 80 and 125% for the ratio of AUC-values and for the ratio of Cmax- values. The decision procedure based on 90% confidence intervals. "

Statistical equivalence tests as an important evaluation tool for method validation

Method validation experiments are often about proving that a target is = 0. For example, when comparing methods, the aim may be to show that the bias (= systematic error) of the test method is negligible (= 0) compared to a comparison method. Or in the case of a robustness or stability study, it must be shown that no relevant changes occur.

The frequently encountered procedure: "The test for difference does not provide a significant difference, so the groups are the same with regard to the examined feature" is incorrect from a statistical point of view. Such a result is an indication, but not a proof. Because these are significance tests with which the rejection of the null hypothesis - which states the equality - can be proven, but not its acceptance.

If the aim of a project is to prove equivalence, the appropriate tests are appropriate: the equivalence tests.

While studies aimed at proving equivalency or non-inferiority are widespread in the pharmaceutical industry and have always been adequately evaluated (since the 1990s the associated tests have been referred to as equivalence tests), the laboratory diagnostics community struggles extremely, to introduce the methodology. The first publication known to us [Lung KR, Gorko MA, Llewelyn J, Wiggins N. Statistical methodfor the determination of equivalence of automated test procedures.J Autom Methods Manag Chem 2003; 25: 123-7] was not reflected.

We have published a corresponding procedure for studies on carry-over, for demonstrating commutability and for comparing methods [Basement T, Brinkmann T (2014). Proposed guidance for carryover studies, based on elementary equivalence testing. Clin. Lab 7, 1153-61; Keller T, Weber S (2009): Statistical Test for Equivalence in Analysis of Commutability Experiments. CCLM 47, 376-377 (Download poster); Keller T, Faye S, Katzorke T (2011): Statistical Test for Equivalence in Analysis of Method Comparison Experiments. Application in comparison of AMH assays. CCLM 49: 806 (Download poster)].

In the meantime, the procedure is slowly finding its way into the community [Holland MD, Budd JR, et. al. (2017): Improved statistical methods for evaluation of stability of in vitro diagnostic reagents, Stat Biopharm Res, 9: 272-278],), even if the test is not yet called an equivalence test in the case of commutability [Nilsson G, Budd JR, Greenberg N, Delatour V, Rej R, Panteghini M, Ceriotti F, Schimmel H, Weykamp C, Basement T, Camara JE, Burns C, Vesper HW, MacKenzie F, Miller WG (2018). IFCC Working Group Recommendations for Assessing commutability Part 2: Using the Difference in Bias Between a Reference Material and Clinical Samples. Clin Chem 64: 455-464].

Figure: Carry-over as a non-inferiority problem, fig from Keller T, Brinkmann T (2014). Proposed guidance for carryover studies, based on elementary equivalence testing. Clin. Lab 7, 1153-61