The Vindication of Magnitude-Based Inference Will G
Hopkins, Alan M Batterham Sportscience 22, 19-29, 2018
(sportsci.org/2018/mbivind.htm) |
||||||

The
changes from the draft version are as follows… ·
From
the comments of Little (2018) and Lakens (2018), and our response (Batterham and Hopkins, 2018), MBI can be described as ·
The
comment of Wilkinson
(2018) supports our interpretation of errors
and error rates in MBI. We see no need
to modify our article in this respect. ·
When
the editor of ·
The
comment of Buchheit
(2018) highlights the plight of a researcher
who understands and has opted for MBI, and for whom (in the absence of viable
alternatives) a return to p values is unthinkable. His comment represents an
endorsement of MBI by an experienced practitioner-researcher who suffered
under p values. No update of this article is required. ·
Some
researchers still need to understand how MBI works and how p values fail to
adequately represent uncertainty in effects. One of us (WGH) therefore put
together a slideshow and two videos, available via this comment (Hopkins, 2018). One of the slides is included below,
with an explanatory paragraph. ·
Soon
after publication of this version, someone raised the concern that MBI could
be viewed as promoting unethically underpowered studies. We have therefore
amended the paragraph
about small samples. Magnitude-based inference (MBI) is an approach to making a
decision about the true or population value of an effect statistic, taking
into account the uncertainty in the magnitude of the statistic provided by a
sample of the population. In response to concerns about error rates with the
decision process (Welsh
and Knight, 2015), we
recently showed that MBI is superior to the traditional approach to
inference, null-hypothesis significance testing (NHST) (Hopkins and Batterham, 2016).
Specifically, the error rates are comparable and often lower than those of
NHST, the publishability rates with small samples are higher, and the potential
for publication bias is negligible. A statistician from Stanford University, Kristin Sainani, has
now attempted to refute our claims about the superiority of MBI to NHST (Sainani, 2018). We
acknowledge the effort expended in her detailed scrutiny and welcome the
opportunity to discuss the points raised in the spirit of furthering
understanding. Sainani argues that MBI should not be used, and that we should
instead "adopt a fully Bayesian analysis" or merely interpret the
standard confidence interval as a plausible range of effect magnitudes
consistent with the data and model. We have no objection to researchers using
either of these two approaches, if they so wish. Nevertheless, we have shown
before and show here again that MBI is a valid, robust approach that has
earned its place in the statistical toolbox. The title of Sainani's critique refers to "the
problem" with magnitude-based inference (MBI), but in the abstract she
claims that there are several problems with the Type-I and Type-II error
rates. In the article itself, she begins her synopsis of MBI with another
apparent problem: that the probabilistic statements in MBI about the
magnitude of the true effect are invalid. Throughout the critique are
numerous inconsistencies and mistakes. We solve here all her perceived
problems, highlight her inconsistencies and correct her mistakes. Should researchers make probabilistic assertions about the true
(population) value of effects? Absolutely, especially for clinically
important effects, where implementation of a possibly beneficial effect in a
clinical or other applied setting carries with it the risk of harm. We use
the term Sainani states early on that "I
completely agree with and applaud" the approach of interpreting the
range of magnitudes of an effect represented by its upper and lower
confidence limit, when reaching a decision about a clinically important
effect. But, according to Sainani, "where Hopkins and Batterham's method
breaks down is when they go beyond simply making qualitative judgments like
this and advocate translating confidence intervals into probabilistic
statements, such as the effect of the supplement is 'very likely trivial' or
'likely beneficial.' This requires interpreting confidence intervals
incorrectly, as if they were Bayesian credible intervals." We have
addressed this concern previously (Hopkins and Batterham, 2016). The
usual confidence interval is congruent with a Bayesian credibility interval
with a minimally informative prior (Burton, 1994; Burton et al.,
1998; Spiegelhalter et al., 2004). As
such, it is an objective estimate of the likely range of the true value, and
the associated probabilistic statements of MBI are Bayesian posterior
probabilities with a minimally informative prior. The post-publication comments of
Little (2018) and
Lakens (2018) further
underscore this point. Unfortunately, full Bayesians disown us, because we prefer not
to turn belief into an informative subjective prior. Meanwhile, NHST-trained
statisticians disown us, because we do not test hypotheses. MBI is therefore
well placed to be a practical haven between a Bayesian rock and an NHST hard
place. (Others have attempted hybrids of Bayes and NHST, albeit with
different goals. See the technical notes.) From Bayesians we adapt valid
probabilistic statements about the true effect, based on the minimally
informative dispersed uniform prior. From frequentists (advocates of NHST) we
adapted straightforward computational methods and assumptions, and we
computed error rates for decisions based not on the null hypothesis but on
sufficiently low or high probabilities for the qualitative magnitude of the
true effect. The name There is a logical inconsistency in Sainani's "qualitative
judgment" of confidence intervals. In her view, it is not appropriate to
make a probabilistic assertion about the true magnitude of the effect, but it
is appropriate to interpret the magnitude of the lower and upper confidence
limits. The problem with this approach is that it all depends on the level of
the confidence interval, so she is in fact making a There is a further inconsistency with Sainani's applause for
qualitative judgments based on the confidence interval: the fact that her
concerns about error rates in MBI would apply to such judgments. Consider,
for example, a confidence interval that overlaps trivial and substantial
magnitudes. What is her qualitative judgment? The effect could be trivial or
substantial, of course. Where is the error in that pronouncement? If the true
effect is trivial, we say there is none, but she says there is an
unacceptable ill-defined Type-I error rate. The only way she can keep a
well-defined NHST Type-I error rate is to make a qualitative judgment Sainani is also inconsistent when she makes the following
statement: "Hopkins and Batterham's logic is that as long as you
acknowledge even a small chance (5-25%) that the effect might be trivial when
it is [truly trivial], then you haven't made a Type I error… But this seems
specious. Is concluding that an effect is 'likely' positive really an
error-free conclusion when the effect is in fact trivial?" Consider the
confidence-interval equivalent of Sainani's statement. A small chance that
the effect could be trivial corresponds to a confidence interval covering
mostly substantial values, with a slight overlap into trivial values, such
that the probability of a trivial true effect is only 6%, for example. Hence
we say the effect could be trivial, so no Type-I error occurs (Figure 1).
Now consider what happens in NHST. If the 95% confidence
interval overlaps the null only slightly, with p=0.06, then a Type-I error
has not occurred (Figure 1). In other words, it's the same kind of decision
process as for MBI, except that in MBI the null is replaced with the smallest
important effect. The same argument could be mounted for Type-II errors:
Sainani does not specifically call our logic here specious, but she does show
later that our definitions "wildly underestimate" the traditional
Type-II error rates. We will not be held accountable for error rates based on
the null hypothesis. Sainani offers a novel solution to her perceived problem with
the definition of MBI Type-I error: allow for "degrees of error",
which inevitably makes higher Type-I error rates. But a similar inflation of
error rates would occur with NHST, if degrees of error were assigned to p
values that approach significance. We doubt if her solution would solve the
problems of the p value that are increasingly voiced in the literature; in
any case, we do not see the need for it with MBI. When an effect is possibly
trivial and possibly substantially positive, that is what the researcher has
found: it's on the way to being substantially positive. Furthermore, for
effects with true values that are close to the smallest important effect, the
outcome with even very large sample sizes will usually be For researchers, reviewers and journal
editors who are still undecided about using MBI in preference to p values,
see Figure 2, which is taken from the slideshow and second video, available
via this comment. The figure shows outcomes with small
samples where the interpretation of the magnitude of the confidence
limits–that is, MBI–provides a succinct and accurate description of the
uncertainty in the magnitude of the true effect, whereas p values fail. For
the two outcomes where the conclusion with non-clinical MBI is
Turning now to the problem of error rates in MBI, we find some
agreement and some disagreement with Sainani about the definitions of error.
We consider that we made a breakthrough with our definitions, because they
focus on trivial and substantial magnitudes rather than the null. As we
stated in our Sports Medicine article (Hopkins and Batterham, 2016), a
valid head-to-head comparison of NHST and MBI requires definitions of Type-I
(false-positive) and Type-II (false-negative) error rates that can be applied
to both approaches. In the traditional definition of a Type-I error, a truly
null effect turns out to be statistically significant. Sample-size estimation
in NHST is all about getting significance for substantial effects, so we
argued that a Type-I error must also occur when any truly In her opening statement on definitions of error, Sainani states
that "Hopkins and Batterham are confused about what to call cases in
which there is a true non-trivial effect, but an inference is made in the
wrong direction (i.e., inferring that a beneficial effect is harmful or that
a harmful effect is beneficial). In the text, they switch between calling
these Type I and Type II errors." Yes, we may have caused confusion with
the following statement: "…implementation of a harmful effect
represents a more serious error than failure to implement a beneficial effect.
Although these two kinds of error are both false-negative type II errors,
they are analogous to the statistical type I and II errors of NHST, so they
are denoted as clinical type I and type II errors, respectively." They
are denoted as clinical Type-I and Type-II errors in the spreadsheet for
sample-size estimation at the Sportscience site, but they are correctly
identified as Type-II errors in our figure defining the errors, in the text
earlier in the article, and in the figures summarizing error rates. Sainani
goes on to state that "in their calculations, they treat them both as
Type II errors (Table 1a). But they can't both be Type II errors at the same
time." We do not understand this assertion, or her justification of it
involving one-tailed tests (but see the technical notes). She concludes with
"inferring that a beneficial effect is harmful is a Type II error,"
with which we agree, "whereas inferring that a harmful effect is
beneficial is a Type I error," with which we disagree. When a true
harmful effect is inferred not to be harmful, it is a Type-II error. Sainani
also notes that a true substantial effect inferred to be substantial of
opposite sign can be called a Type-III error, but we see no need for this
additional complication. That said, we do see the need to control the error
rate when truly harmful effects are inferred to be potentially beneficial.
Our rebuttal of Sainani's assertions about error rates might not
satisfy fundamentalist adherents of NHST. Figure 3 shows our original figure
from the Sports Medicine article and an enlargement of the Type-I rates. We
did not misrepresent these rates in the text, but arguably we presented them
in a manner that favored MBI: "For null and positive trivial values, the
type I rates for clinical MBI exceeded those for NHST for a sample size of
50+50 (~15–70 % versus ~5–40 %), while for the largest sample size, the type
I rates for clinical MBI (~2–75 %) were intermediate between those of
conservative NHST (~0.5–50 %) and conventional NHST (5–80 %)." These
error rates are consistent with those presented by Sainani, but the changes
of scale for the different true-effect magnitudes in her figure gives an
unfavorable impression of the MBI rates. We gave an honest account of the higher Type-I error rates with
odds-ratio MBI, which Sainani did not address. Our justification for keeping
this version of MBI in the statistical toolbox along with clinical MBI seems
reasonable. From the Sports Medicine article: "The Type-I rates for
clinical MBI were substantially higher than those for NHST for null and
positive true values with a sample size of 50+50. The probabilistic
inferences for the majority of these errors were only possibly beneficial, so
a clinician would make the decision to use a treatment based on the effect,
knowing that there was not a high probability of benefit. Type-I error rates
for odds-ratio MBI were the largest of all the inferential methods for null
and positive trivial effects, but for the most part these rates were due to
outcomes where the chance of benefit was rated unlikely or very unlikely, but
the risk of harm was so much lower that the odds ratio was >66. Inspection
of the confidence intervals for such effects would leave the clinician with
little expectation of benefit if the effect were implemented, so the high
Type-I error rates should not be regarded as a failing of this
approach." In her discussion, Sainani asserts: "Whereas standard
hypothesis testing has predictable Type I error rates, MBI has Type I error
rates that vary greatly depending on the sample size and choice of thresholds
for harm/benefit. This is problematic because unless researchers calculate
and report the Type I error for every application, this will always be hidden
to readers." But the "well-defined" Type-I rate for NHST is
only for the null; for trivial true effects it also varies widely with sample
size and choice of magnitude thresholds, and this variation is also hidden
from readers. The fact that the Type-I error rate for MBI peaks at the
optimum sample size (the minimum sample size for practically all outcomes to
be clear) is no cause for concern, because sample-size estimation in MBI is
based on controlling the Type-II rates. She goes on with this particularly
galling assertion: "Furthermore, the dependence on the thresholds for
harm/benefit makes it easy to game the system. A researcher could tweak these
values until they get an inference they like." This is a fatuous charge
to level against MBI. Any system of inference is open to abuse, if
researchers are so minded. A researcher who assesses the importance of
a statistically significant or non-significant outcome can choose the value
of the smallest important effect at that stage to suit the outcome obtained
with the sample. Researchers also game the NHST system by providing a justification
for sample size based on moderate effects. Sainani presumably has the same
concerns about full (subjective) Bayesians gaming not only the smallest
important effect but also the prior to get the most pleasing or publishable
outcome. Sainani's only remaining substantial concern about our
definition of error rates is not so easily dismissed. MBI provides a new
category of inferential outcome: Sainani concludes her critique with the following solution to
fix what she regards as the MBI Type-I error problem: "…a one-sided
null hypothesis test for benefit–interpreted alongside the corresponding
confidence interval–would achieve most of the objectives of clinical MBI
while properly controlling Type I error." We disagree. First, we do not
wish to conduct "tests" of any kind; we embrace uncertainty and
prefer estimation to "testimation", to borrow from Ziliak and McCloskey (2008). Secondly, the p value from her proposed
one-sided test against the non-zero null given by the minimum clinically
important difference is precisely equivalent to 1 minus the probability of
benefit from MBI. If the one-sided test is conducted at a conventional 5%
alpha level, the implication is that Sainani requires >95% chance of
benefit to declare a treatment effective–equivalent to our Before we leave the issue of error rates, it is important to
note that the theoretical basis of NHST is now held to be untrustworthy by
some highly cited establishment statisticians. Consider, for example, the
following comments of two contributors to the American Statistical
Association's policy statement on p values (Wasserstein and Lazar, 2016; see the supplement):
"we should advise today’s students of statistics that they should avoid
statistical significance testing (Ken Rothman)" and "hypothesis
testing as a concept is perhaps the root cause of the problem (Roderick
Little)." If they are right, it follows that the traditional definitions
of Type-I and Type-II errors, both of which are based on the null hypothesis,
are themselves unrealistic and untrustworthy. Our definitions deserve more
recognition as a possible way forward. In her criticisms of the theory of MBI, Sainani claims that the
three references we cited in our Sports Medicine
article to support the sound theoretical
basis of MBI "do not provide such
evidence." We will now show that her claim is misleading
or incorrect for all three references. The first reference is Gurrin et al. (2000), from
which she quotes correctly: "Although the use of a uniform prior
probability distribution provides a neat introduction to the Bayesian
process, there are a number of reasons why the uniform prior distribution
does not provide the foundation on which to base a bold new theory of
statistical analysis!" However, she neglects to point out that later in
the same article Gurrin et al. make this statement: "One of the problems
with Bayesian analysis is that it is often a non-trivial problem to combine
the prior information and the current data to produce the posterior
distribution… The congruence between conventional confidence intervals and
Bayesian credible intervals generated using a uniform prior distribution
does, however, provide a simple way to obtain inferences in Bayesian form
which can be implemented using standard software based on the results and
output of a conventional statistical analysis… Our approach [effectively MBI]
is straightforward to implement, offers the potential to describe the results
of conventional analyses in a manner that is more easily understood, and The second reference supporting MBI is Shakespeare et al. (2001).
Sainani states that this article "just provides general information on
confidence intervals, and does not address anything directly related to
MBI." On this point she is also wrong. The method presented by
Shakespeare et al. to derive what they refer to as "confidence
levels" uses precisely the same methods as MBI to derive the probability
of benefit beyond a threshold for the minimum clinically important difference.
For example, the authors present the following re-analysis of a previously
published study using their method: "The study found a survival benefit
of 28% favoring immediate nodal dissection (hazard ratio 0·72, 95% CI
0·49–1·04). There is a… 94% level of confidence [i.e., chance of benefit]
that the survival benefit is clinically relevant (improvement in survival of
3% or more). The third reference that she claims does not provide evidence
supporting MBI is our letter to the editor (Batterham and Hopkins, 2015) in
response to the article by Welsh and Knight (2015). By her
account, this reference "is a short letter in which they point to
empirical evidence from a simulation that I believe is a preliminary version
of the simulations reported in Sports Science [sic]." But the issue here
is the theoretical basis of MBI, which indeed we had argued succinctly in the
letter. Hence this claim also is wrong. Finally, the overarching negative tone of Sainani's critique
deserves attention. We counted three occasions in the article where she gives
any credit to our achievement with MBI, but each is immediately followed by
an assertion that we were misguided or mistaken. She is the one who is
misguided or mistaken. It is deeply disappointing and discouraging when
someone in her position of influence fails to notice or acknowledge the
following There is still room for debate that
could result in improvements in MBI. The most obvious debatable feature are
the rules we have devised for deciding when effects are clear in clinical and
non-clinical settings–in other words, the rules for acceptable uncertainty in
the two settings. These rules in turn depend on the threshold probabilities
that define the terms We have demonstrated that the error rates in MBI are acceptable overall. However, those wishing to use MBI, but who remain concerned with error rates, could present an additional statistic with excellent error control, the second-generation p-value (SGPV) (Blume et al., 2018). Briefly, this statistic is based on an interval null hypothesis equivalent to the trivial region in MBI. The SGPV is not a probability; rather it is the proportion of hypotheses supported by the data and model that are trivial. If the SGPV=0, then the data support only clinically meaningful hypotheses. If the SGPV=1, then the data support only trivial hypotheses. Values between 0 and 1 reflect the degree of support for clinically meaningful or trivial hypotheses, with a SGPV of 0.5 indicating that the data are strictly inconclusive. If the attack on MBI results in
journal editors banning the use of MBI in submitted manuscripts, and if the
editors do not accept MBI as The lack of substantial bias with
MBI should not be used as an excuse for performing underpowered studies. In
the simulations of controlled trials where the MBI-optimal sample size was 50
in each group, a sample size of 10 in each group resulted in ~55-65% unclear
non-clinical effects and ~20-65% unclear clinical effects over the range of
true trivial effects (Hopkins and Batterham, 2016). It is
unethical to undertake research when the expectation of a decisive outcome
for trivial effects is determined by a coin toss, but when an optimal sample
size for trivial effects is not possible, should the research should still be
performed? Yes, if there is a genuine expectation that the effect will have
sufficient magnitude to be clear, or if another cohort of participants can be
recruited eventually to make the sample size adequate (although the bias with
a group-sequential design in MBI has yet to be investigated). The smaller
sample sizes for publishability with MBI reduce the risk of unethically underpowered
studies compared with NHST. In conclusion, MBI represents a trustworthy mechanism for
representing the uncertainty in effects with well-defined qualitative
categories of probability. It beggars belief that any journal reviewer or
editor could take exception to publication of an effect as being Acknowledgments: Thanks to Steve Marshall, Ken Quarrie and Fabio Serpiello, who provided useful feedback on drafts. Thanks also to those who responded with the published comments, and to colleagues at Victoria University (Fabio Serpiello, Steph Blair, Luca Oppici, and Craig Pickett), who provided feedback on the slideshow/videos, and to Ken Quarrie, who provided the Sainani-style error graphics therein. ## Technical notes
Throughout
this article, Some
full Bayesians have previously taken exception to the non-informative or
"flat" prior of MBI, by invoking two arguments. First, representing
such a prior mathematically is an intractable problem (Barker and Schofield, 2008). We
delighted in parodying this argument by calling the flat prior an imaginary
Bayesian monster (Hopkins and
Batterham, 2010): the
argument is easily dismissed simply by making the prior minimally
informative, which makes the prior tractable but makes no substantial
difference to the posterior. The second argument is that a uniform flat or
minimally informative prior must become non-uniform, if the dependent
variable is transformed, for example using logarithms or any of the
transformations in generalized linear modeling (e.g., Gurrin et al., 2000). Again,
this argument is easily dismissed: the flat or minimally informative prior is
applied to the transformation of the dependent variable in a model that makes
least non-uniformity of the effect and error compared with any other
transformations (including non-transformation) and models. What happens to
the prior with these other transformations and models is irrelevant. Interestingly,
if we were full Bayesians, we might not be expected to concern ourselves with
error control, as some full Bayesians distinguish "beliefs" from
estimates of "true" values; for them, frequentist notions such as
Type-I errors do not exist (Ventz and Trippa, 2015). A full
Bayesian–with the caveat that more than 30 years ago there were already
46,656 kinds (Good, 1982)–might
say, for example, that "75% of the credible values exceed the minimum
clinically important threshold for benefit”, whereas the MBI exponent would
claim that "the probability that the true value of the treatment exceeds
the threshold for benefit is 75%; that is, the treatment is likely
beneficial." In MBI, adopting a least-informative prior and making
decisions based on a posterior distribution equivalent to the likelihood
arguably requires us to give due consideration to error control, which we
have done. The general notion of Bayesian inference with a model chosen to
yield inferences with good frequency properties has been described as
"Calibrated Bayes" (Little, 2011; Little, 2006). Other
attempts at reconciling Bayesian and frequentist paradigms include
"Constrained Optimal Bayesian" designs (Ventz and Trippa, 2015). Meanwhile, to make probabilistic statements, Sainani
recommends we adopt a full Bayesian analysis, in which there is no apparent
requirement for error control, while lambasting MBI for having higher error
rates in some scenarios. Her position once again is inconsistent. ## References
Cohen J (1994). The earth is round (p < .05). American Psychologist 49, 997-1003 Mastrandrea MD, Field CB, Stocker TF, Edenhofer O, Ebi KL, Frame DJ, Held H, Kriegler E, Mach KJ, Matschoss PR, Plattner G-K, Yohe GW, Zwiers FW (2010). Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties. Intergovernmental Panel on Climate Change (IPCC): https://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf Draft 2 published
14 May 2018. |