Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups:
Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related to these differences.
Here are some counts:
Now, the control and disabled are the same but I wanted a way to be confident in my A/B/C statistical validation, where my theory was that A=B so there should be no difference/improvement or loss. The A/B test has been running for two weeks and I have:
At this size of sample, the prior does not seem to have a big influence but I am using a beta curve. When comparing A vs B, B vs C and A vs C, I found that:
I don't feel like I can trust any of those results because B is beating A where I would expect the model to not be able to tell (confidence <90% at least), so I must have misunderstood something.
Any help or explanations would be greatly appreciated
I assume that you use the independence assumption, but with your sample sizes, I tend to distrust it! There must be some subgroups in your data, you did not give much details --- but maybe country, age, something else. There might be variation of churn rate with some such groups, and if the distribution of such variables are different within your tree treatment groups, that contributes to the differences you have observed. The standard errors computed from the binomial distribution will then be too small ( unobserved heterogeneity ). Some calculations with your data:
so your disabled group is almost midpoint between control and enabled. One idea could be to treat control and disabled as two control groups, and then the difference between them is really caused by the variance (including non-binomial variance) and that could be a basis for an alternative analysis. For now, here is a stored google scholar search for papers about analysis with two control groups. I will look into it ... but out of time now.
EDIT After the question included more information. Since there are additional variables like age range, gender, country, this can be controlled for, and will help to control/estimate (binomial) overdispersion. One way to do it is using a mixed effects logistic regression, which are discussed in many posts at this site, for instance
How can I deal with overdispersion in a logistic (binomial) glm using R? ,
Overdispersion in logistic regression ,
difference between mixed effect logistic regression and logistic regression
Sign up or log in, post as a guest.
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .
Hypothesis testing explained in 4 parts, yuzheng sun, phd.
As data scientists, Hypothesis Testing is expected to be well understood, but often not in reality. It is mainly because our textbooks blend two schools of thought – p-value and significance testing vs. hypothesis testing – inconsistently.
For example, some questions are not obvious unless you have thought through them before:
Are power or beta dependent on the null hypothesis?
Can we accept the null hypothesis? Why?
How does MDE change with alpha holding beta constant?
Why do we use standard error in Hypothesis Testing but not the standard deviation?
Why can’t we be specific about the alternative hypothesis so we can properly model it?
Why is the fundamental tradeoff of the Hypothesis Testing about mistake vs. discovery, not about alpha vs. beta?
Addressing this problem is not easy. The topic of Hypothesis Testing is convoluted. In this article, there are 10 concepts that we will introduce incrementally, aid you with visualizations, and include intuitive explanations. After this article, you will have clear answers to the questions above that you truly understand on a first-principle level and explain these concepts well to your stakeholders.
We break this article into four parts.
Set up the question properly using core statistical concepts, and connect them to Hypothesis Testing, while striking a balance between technically correct and simplicity. Specifically,
We emphasize a clear distinction between the standard deviation and the standard error, and why the latter is used in Hypothesis Testing
We explain fully when can you “accept” a hypothesis, when shall you say “failing to reject” instead of “accept”, and why
Introduce alpha, type I error, and the critical value with the null hypothesis
Introduce beta, type II error, and power with the alternative hypothesis
Introduce minimum detectable effects and the relationship between the factors with power calculations , with a high-level summary and practical recommendations
In Hypothesis Testing, we begin with a null hypothesis , which generally asserts that there is no effect between our treatment and control groups. Commonly, this is expressed as the difference in means between the treatment and control groups being zero.
The central limit theorem suggests an important property of this difference in means — given a sufficiently large sample size, the underlying distribution of this difference in means will approximate a normal distribution, regardless of the population's original distribution. There are two notes:
1. The distribution of the population for the treatment and control groups can vary, but the observed means (when you observe many samples and calculate many means) are always normally distributed with a large enough sample. Below is a chart, where the n=10 and n=30 correspond to the underlying distribution of the sample means.
2. Pay attention to “the underlying distribution”. Standard deviation vs. standard error is a potentially confusing concept. Let’s clarify.
Let’s declare our null hypothesis as having no treatment effect. Then, to simplify, let’s propose the following normal distribution with a mean of 0 and a standard deviation of 1 as the range of possible outcomes with probabilities associated with this null hypothesis.
The language around population, sample, group, and estimators can get confusing. Again, to simplify, let’s forget that the null hypothesis is about the mean estimator, and declare that we can either observe the mean hypothesis once or many times. When we observe it many times, it forms a sample*, and our goal is to make decisions based on this sample.
* For technical folks, the observation is actually about a single sample, many samples are a group, and the difference in groups is the distribution we are talking about as the mean hypothesis. The red curve represents the distribution of the estimator of this difference, and then we can have another sample consisting of many observations of this estimator. In my simplified language, the red curve is the distribution of the estimator, and the blue curve with sample size is the repeated observations of it. If you have a better way to express these concepts without causing confusiongs, please suggest.
This probability density function means if there is one realization from this distribution, the realitization can be anywhere on the x-axis, with the relative likelihood on the y-axis.
If we draw multiple observations , they form a sample . Each observation in this sample follows the property of this underlying distribution – more likely to be close to 0, and equally likely to be on either side, which makes the odds of positive and negative cancel each other out, so the mean of this sample is even more centered around 0.
We use the standard error to represent the error of our “sample mean” .
The standard error = the standard deviation of the observed sample / sqrt (sample size).
For a sample size of 30, the standard error is roughly 0.18. Compared with the underlying distribution, the distribution of the sample mean is much narrower.
In Hypothesis Testing, we try to draw some conclusions – is there a treatment effect or not? – based on a sample. So when we talk about alpha and beta, which are the probabilities of type I and type II errors , we are talking about the probabilities based on the plot of sample means and standard error .
From Part 1, we stated that a null hypothesis is commonly expressed as the difference in means between the treatment and control groups being zero.
Without loss of generality*, let’s assume the underlying distribution of our null hypothesis is mean 0 and standard deviation 1
Then the sample mean of the null hypothesis is 0 and the standard error of 1/√ n, where n is the sample size.
When the sample size is 30, this distribution has a standard error of ≈0.18 looks like the below.
*: A note for the technical readers: The null hypothesis is about the difference in means, but here, without complicating things, we made the subtle change to just draw the distribution of this “estimator of this difference in means”. Everything below speaks to this “estimator”.
The reason we have the null hypothesis is that we want to make judgments, particularly whether a treatment effect exists. But in the world of probabilities, any observation, and any sample mean can happen, with different probabilities. So we need a decision rule to help us quantify our risk of making mistakes.
The decision rule is, let’s set a threshold. When the sample mean is above the threshold, we reject the null hypothesis; when the sample mean is below the threshold, we accept the null hypothesis.
It’s worth noting that you may have heard of “we never accept a hypothesis, we just fail to reject a hypothesis” and be subconsciously confused by it. The deep reason is that modern textbooks do an inconsistent blend of Fisher’s significance testing and Neyman-Pearson’s Hypothesis Testing definitions and ignore important caveats ( ref ). To clarify:
First of all, we can never “prove” a particular hypothesis given any observations, because there are infinitely many true hypotheses (with different probabilities) given an observation. We will visualize it in Part 3.
Second, “accepting” a hypothesis does not mean that you believe in it, but only that you act as if it were true. So technically, there is no problem with “accepting” a hypothesis.
But, third, when we talk about p-values and confidence intervals, “accepting” the null hypothesis is at best confusing. The reason is that “the p-value above the threshold” just means we failed to reject the null hypothesis. In the strict Fisher’s p-value framework, there is no alternative hypothesis. While we have a clear criterion for rejecting the null hypothesis (p < alpha), we don't have a similar clear-cut criterion for "accepting" the null hypothesis based on beta.
So the dangers in calling “accepting a hypothesis” in the p-value setting are:
Many people misinterpret “accepting” the null hypothesis as “proving” the null hypothesis, which is wrong;
“Accepting the null hypothesis” is not rigorously defined, and doesn’t speak to the purpose of the test, which is about whether or not we reject the null hypothesis.
In this article, we will stay consistent within the Neyman-Pearson framework , where “accepting” a hypothesis is legal and necessary. Otherwise, we cannot draw any distributions without acting as if some hypothesis was true.
You don’t need to know the name Neyman-Pearson to understand anything, but pay attention to our language, as we choose our words very carefully to avoid mistakes and confusion.
So far, we have constructed a simple world of one hypothesis as the only truth, and a decision rule with two potential outcomes – one of the outcomes is “reject the null hypothesis when it is true” and the other outcome is “accept the null hypothesis when it is true”. The likelihoods of both outcomes come from the distribution where the null hypothesis is true.
Later, when we introduce the alternative hypothesis and MDE, we will gradually walk into the world of infinitely many alternative hypotheses and visualize why we cannot “prove” a hypothesis.
We save the distinction between the p-value/significance framework vs. Hypothesis Testing in another article where you will have the full picture.
We’re able to construct a distribution of the sample mean for this null hypothesis using the standard error. Since we only have the null hypothesis as the truth of our universe, we can only make one type of mistake – falsely rejecting the null hypothesis when it is true. This is the type I error , and the probability is called alpha . Suppose we want alpha to be 5%. We can calculate the threshold required to make it happen. This threshold is called the critical value . Below is the chart we further constructed with our sample of 30.
In this chart, alpha is the blue area under the curve. The critical value is 0.3. If our sample mean is above 0.3, we reject the null hypothesis. We have a 5% chance of making the type I error.
Type I error: Falsely rejecting the null hypothesis when the null hypothesis is true
Alpha: The probability of making a Type I error
Critical value: The threshold to determine whether the null hypothesis is to be rejected or not
You may have noticed in part 2 that we only spoke to Type I error – rejecting the null hypothesis when it is true. What about the Type II error – falsely accepting the null hypothesis when it is not true?
But it is weird to call “accepting” false unless we know the truth. So we need an alternative hypothesis which serves as the alternative truth.
There is an important concept that most textbooks fail to emphasize – that is, you can have infinitely many alternative hypotheses for a given null hypothesis, we just choose one. None of them are more special or “real” than the others.
Let’s visualize it with an example. Suppose we observed a sample mean of 0.51, what is the true alternative hypothesis?
With this visualization, you can see why we have “infinitely many alternative hypotheses” because, given the observation, there is an infinite number of alternative hypotheses (plus the null hypothesis) that can be true, each with different probabilities. Some are more likely than others, but all are possible.
Remember, alternative hypotheses are a theoretical construct. We choose one particular alternative hypothesis to calculate certain probabilities. By now, we should have more understanding of why we cannot “accept” the null hypothesis given an observation. We can’t prove that the null hypothesis is true, we just fail to accept it given the observation and our pre-determined decision rule.
We will fully reconcile this idea of picking one alternative hypothesis out of the world of infinite possibilities when we talk about MDE. The idea of “accept” vs. “fail to reject” is deeper, and we won’t cover it fully in this article. We will do so when we have an article about the p-value and the confidence interval.
For the sake of simplicity and easy comparison, let’s choose an alternative hypothesis with a mean of 0.5, and a standard deviation of
1. Again, with a sample size of 30, the standard error ≈0.18. There are now two potential “truths” in our simple universe.
Remember from the null hypothesis, we want alpha to be 5% so the corresponding critical value is 0.30. We modify our rule as follows:
If the observation is above 0.30, we reject the null hypothesis and accept the alternative hypothesis ;
If the observation is below 0.30, we accept the null hypothesis and reject the alternative hypothesis .
With the introduction of the alternative hypothesis, the alternative “(hypothesized) truth”, we can call “accepting the null hypothesis and rejecting the alternative hypothesis” a mistake – the Type II error. We can also calculate the probability of this mistake. This is called beta, which is illustrated by the red area below.
From the visualization, we can see that beta is conditional on the alternative hypothesis and the critical value. Let’s elaborate on these two relationships one by one, very explicitly, as both of them are important.
First, Let’s visualize how beta changes with the mean of the alternative hypothesis by setting another alternative hypothesis where mean = 1 instead of 0.5
Beta change from 13.7% to 0.0%. Namely, beta is the probability of falsely rejecting a particular alternative hypothesis when we assume it is true. When we assume a different alternative hypothesis is true, we get a different beta. So strictly speaking, beta only speaks to the probability of falsely rejecting a particular alternative hypothesis when it is true . Nothing else. It’s only under other conditions, that “rejecting the alternative hypothesis” implies “accepting” the null hypothesis or “failing to accept the null hypothesis”. We will further elaborate when we talk about p-value and confidence interval in another article. But what we talked about so far is true and enough for understanding power.
Second, there is a relationship between alpha and beta. Namely, given the null hypothesis and the alternative hypothesis, alpha would determine the critical value, and the critical value determines beta. This speaks to the tradeoff between mistake and discovery.
If we tolerate more alpha, we will have a smaller critical value, and for the same beta, we can detect a smaller alternative hypothesis
If we tolerate more beta, we can also detect a smaller alternative hypothesis.
In short, if we tolerate more mistakes (either Type I or Type II), we can detect a smaller true effect. Mistake vs. discovery is the fundamental tradeoff of Hypothesis Testing.
So tolerating more mistakes leads to more chance of discovery. This is the concept of MDE that we will elaborate on in part 4.
Finally, we’re ready to define power. Power is an important and fundamental topic in statistical testing, and we’ll explain the concept in three different ways.
First, the technical definition of power is 1−β. It represents that given an alternative hypothesis and given our null, sample size, and decision rule (alpha = 0.05), the probability is that we accept this particular hypothesis. We visualize the yellow area below.
Second, power is really intuitive in its definition. A real-world example is trying to determine the most popular car manufacturer in the world. If I observe one car and see one brand, my observation is not very powerful. But if I observe a million cars, my observation is very powerful. Powerful tests mean that I have a high chance of detecting a true effect.
Third, to illustrate the two concepts concisely, let’s run a visualization by just changing the sample size from 30 to 100 and see how power increases from 86.3% to almost 100%.
As the graph shows, we can easily see that power increases with sample size . The reason is that the distribution of both the null hypothesis and the alternative hypothesis became narrower as their sample means got more accurate. We are less likely to make either a type I error (which reduces the critical value) or a type II error.
Type II error: Failing to reject the null hypothesis when the alternative hypothesis is true
Beta: The probability of making a type II error
Power: The ability of the test to detect a true effect when it’s there
The relationship between mde, alternative hypothesis, and power.
Now, we are ready to tackle the most nuanced definition of them all: Minimum detectable effect (MDE). First, let’s make the sample mean of the alternative hypothesis explicit on the graph with a red dotted line.
What if we keep the same sample size, but want power to be 80%? This is when we recall the previous chapter that “alternative hypotheses are theoretical constructs”. We can have a different alternative that corresponds to 80% power. After some calculations, we discovered that when it’s the alternative hypothesis with mean = 0.45 (if we keep the standard deviation to be 1).
This is where we reconcile the concept of “infinitely many alternative hypotheses” with the concept of minimum detectable delta. Remember that in statistical testing, we want more power. The “ minimum ” in the “ minimum detectable effect”, is the minimum value of the mean of the alternative hypothesis that would give us 80% power. Any alternative hypothesis with a mean to the right of MDE gives us sufficient power.
In other words, there are indeed infinitely many alternative hypotheses to the right of this mean 0.45. The particular alternative hypothesis with a mean of 0.45 gives us the minimum value where power is sufficient. We call it the minimum detectable effect, or MDE.
Let’s go through how we derived MDE from the beginning:
We fixed the distribution of sample means of the null hypothesis, and fixed sample size, so we can draw the blue distribution
For our decision rule, we require alpha to be 5%. We derived that the critical value shall be 0.30 to make 5% alpha happen
We fixed the alternative hypothesis to be normally distributed with a standard deviation of 1 so the standard error is 0.18, the mean can be anywhere as there are infinitely many alternative hypotheses
For our decision rule, we require beta to be 20% or less, so our power is 80% or more.
We derived that the minimum value of the observed mean of the alternative hypothesis that we can detect with our decision rule is 0.45. Any value above 0.45 would give us sufficient power.
Now, let’s tie everything together by increasing the sample size, holding alpha and beta constant, and see how MDE changes.
Narrower distribution of the sample mean + holding alpha constant -> smaller critical value from 0.3 to 0.16
+ holding beta constant -> MDE decreases from 0.45 to 0.25
This is the other key takeaway: The larger the sample size, the smaller of an effect we can detect, and the smaller the MDE.
This is a critical takeaway for statistical testing. It suggests that even for companies not with large sample sizes if their treatment effects are large, AB testing can reliably detect it.
Let’s review all the concepts together.
Assuming the null hypothesis is correct:
Alpha: When the null hypothesis is true, the probability of rejecting it
Critical value: The threshold to determine rejecting vs. accepting the null hypothesis
Assuming an alternative hypothesis is correct:
Beta: When the alternative hypothesis is true, the probability of rejecting it
Power: The chance that a real effect will produce significant results
Power calculation:
Minimum detectable effect (MDE): Given sample sizes and distributions, the minimum mean of alternative distribution that would give us the desired alpha and sufficient power (usually alpha = 0.05 and power >= 0.8)
Relationship among the factors, all else equal: Larger sample, more power; Larger sample, smaller MDE
Everything we talk about is under the Neyman-Pearson framework. There is no need to mention the p-value and significance under this framework. Blending the two frameworks is the inconsistency brought by our textbooks. Clarifying the inconsistency and correctly blending them are topics for another day.
That’s it. But it’s only the beginning. In practice, there are many crafts in using power well, for example:
Why peeking introduces a behavior bias, and how to use sequential testing to correct it
Why having multiple comparisons affects alpha, and how to use Bonferroni correction
The relationship between sample size, duration of the experiment, and allocation of the experiment?
Treat your allocation as a resource for experimentation, understand when interaction effects are okay, and when they are not okay, and how to use layers to manage
Practical considerations for setting an MDE
Also, in the above examples, we fixed the distribution, but in reality, the variance of the distribution plays an important role. There are different ways of calculating the variance and different ways to reduce variance, such as CUPED, or stratified sampling.
Related resources:
How to calculate power with an uneven split of sample size: https://blog.statsig.com/calculating-sample-sizes-for-a-b-tests-7854d56c2646
Real-life applications: https://blog.statsig.com/you-dont-need-large-sample-sizes-to-run-a-b-tests-6044823e9992
Statsig for startups.
Statsig offers a generous program for early-stage startups who are scaling fast and need a sophisticated experimentation platform.
Try statsig today.
Controlling your type i errors: bonferroni and benjamini-hochberg.
The Benjamini-Hochberg procedure on Statsig reduces false positives in experiments by adjusting significance levels for multiple comparisons, ensuring reliable results.
I discussed 8 A/B testing mistakes with Allon Korem (Bell Statistics) and Tyler VanHaren (Statsig). Learn fixes to improve accuracy and drive better business outcomes.
Introducing Differential Impact Detection: Identify how different user groups respond to treatments and gain useful insights from varied experiment results.
Identify power users to drive growth and engagement. Learn to pinpoint and leverage these key players with targeted experiments for maximum impact.
Simplify data pipelines with Statsig. Use SDKs, third-party integrations, and Data Warehouse Native Solution for effortless data ingestion at any stage.
Learn how we use Statsig to enhance our NestJS API servers, reducing request processing time and CPU usage through performance experiments.
Explore all metrics
This article proposes objective Bayesian multiple testing procedures for a normal model. The challenging task of considering all the configurations of true and false null hypotheses is addressed here by ordering the null hypotheses based on their Bayes factors. This approach reduces the size of the compared models for posterior search from \(2^k\) to \(k+1\) , for k null hypotheses. Furthermore, the consistency of the proposed multiple testing procedures is established and their behavior is analyzed with simulated and real examples. In addition, the proposed procedures are compared with classical and Bayesian multiple testing procedures in all the possible configurations of true and false ordered null hypotheses.
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Bayes minimax competitors of preliminary test estimators in k sample problems, data availability.
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Abramovich, F., & Angelini, C. (2006). Bayesian maximum a posteriori multiple testing procedure. Sankhya, 68 , 436–460.
Google Scholar
Abramovich, F., Grinshtein, V., & Pensky, M. (2007). On optimality of Bayesian testimation in the normal means problem. The Annals of Statistics, 35 , 2261–2286.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57 , 289–300.
Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91 , 109–122.
Casella, G., & Moreno, E. (2006). Objective Bayesian variable selection. Journal of the American Statistical Association, 101 , 157–167.
Dunnett, C., & Tamhane, A. C. (1991). Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts. Statistics in Medicine, 11 , 1057–1063.
Dunnett, C., & Tamhane, A. C. (1992). A step-up multiple test procedure. Journal of the American Statistical Association, 87 , 162–170.
Efron, B. (2003). Robbins, empirical Bayes and micorarrays. The Annals of Statistics, 31 , 366–378.
Efron, B., Tibshirani, R., Storey, J. D., & Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96 , 1151–1160.
Finner, H. (1993). On a monotonicity problem in step-down multiple test procedures. Journal of the American Statistical Association, 88 , 920–923.
Girón, F. J., Martìnez, M. L., Moreno, E., & Torres, F. (2006). Objective testing procedures in linear models. Scandinavian Journal of Statistics, 33 , 765–784.
Hochberg, Y., & Tamhane, A. C. (1987). Multiple comparison procedures . Wiley.
Holm, S. (1999). Multiple confidence sets based on stagewise tests. Journal of the American Statistical Association, 94 , 489–495.
Hsu, J. C. (1996). Multiple comparison: Theory and methods . Chapman & Hall/CRC.
Kang, S. G., Lee, W. D., & Kim, Y. (2022). Objective Bayesian variable selection in the linear regression model. Journal of Statistical Computation and Simulation, 92 , 1133–1157.
Liu, W. (1997). Stepwise tests when the test statistics are independent. The Australian Journal of Statistics, 39 , 169–177.
Moreno, E., Bertolino, F., & Racugno, W. (1998). An intrinsic limiting procedure for model selection and hypotheses testing. Journal of the American Statistical Association, 93 , 1451–1460.
Morris, C. M. (1987). Discussion on ‘Reconciling Bayesian and frequentist evidence in the one-sided testing problem’ (by Casella and Berger). Journal of the American Statistical Association, 82 , 106–111.
Romano, A. (1977). Applied statistics for science and industry . Allyn and Bacon.
Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. The Annals of Statistics, 30 , 239–257.
Sarkar, S. K., & Chen, J. (2004). A Bayesian stepwise multiple testing procedure . Technical Report, Temple University.
Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64 , 479–498.
Tamhane, A. C., Liu, W., & Dunnett, C. W. (1998). A generalized step-up-down multiple test procedure. The Canadian Journal of Statistics, 26 , 55–68.
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment . Wiley.
Download references
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (RS-2023-00240494) and Learning & Academic research institution for Master’s PhD students, and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (RS-2023-00301914).
Authors and affiliations.
Department of Computer and Data Information, Sangji University, Wonju, 26339, Korea
Sang Gil Kang
Department of Statistics, Kyungpook National University, Daegu, 41566, Korea
KNU G-LAMP Research Center, Institute of Basic Sciences, Kyungpook National University, Daegu, 41566, Korea
You can also search for this author in PubMed Google Scholar
Correspondence to Yongku Kim .
Conflict of interest.
We have no conflicts of interest to disclose.
Consider the models \(M_{i1}\) ,
and the models \(M_{i2}\) ,
where where \(c_{i1}\) and \(c_{i2}\) are an arbitrary positive constants, \(\varvec{\theta }_{i1}=\tau _i\) and \(\varvec{\theta }_{i2}=(\mu _i,\sigma _i^2)\) . For the minimal training sample vector \(\mathbf{z}\) , we have
Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{i2}\) becomes
Hence the Lemma 1 is proved.
We start to give a brief summary of the intrinsic Bayes factor and the intrinsic methodology. The key idea is to split the sample \(\mathbf{x}\) into two parts \(\mathbf{x}=(x(l), x(n-l))\) . Part \(x(l)=(x_1,\cdots ,x_l)\) , the training sample, is devoted to converting the improper prior \(\pi _i^N(\theta _i)\) to a proper distribution
where \(m_i^N ( x(l) )=\int f_i (x(l) \vert \theta _i ) \pi _i^N(\theta _i)d\theta _i, i=1,2\) . The Bayes factor is then computed using the remaining data \(x(n-l)\) and \(\pi _i (\theta _i \vert x(l))\) as the prior distribution. The resulting partial Bayes factor is
\(B_{21} (x(n-l) \vert x(l))\) is well defined only if x ( l ) is such that \(0< m_i^N (x(l))<\infty , i=1,2\) . If there is no subsample of x ( l ) for which the second inequality holds, x ( l ) is called a minimal training sample.
The partial Bayes factor depends on the specific training sample x ( l ). To avoid the difficulty of choosing x ( l ), Berger and Pericchi ( 1996 ) proposed the use of a minimal training sample to compute \(B_{21} (x(n-l) \vert x(l))\) . Then, an average over all the possible minimal training samples contained in the sample is computed. This gives the arithmetic intrinsic Bayes factor (AIBF) of \(M_2\) against \(M_1\) as
where L is the number of minimal training samples x ( l ) contained in \(\mathbf{x}\) .
Note that the intrinsic Bayes factor is not actual Bayes factors. Further, stability of the AIBF is a matter of concern. Conceivably, for a given sample \(\mathbf{x}\) , the number of minimal training samples might to be small and minor changes in the data could cause this number to vary substantially. Moreover, the equality \(B_{21}^{AI}(\mathbf{x})=1/B_{12}^{AI}(\mathbf{x})\) is not necessarily satisfied, so that the coherent equality does not hold.
To be coherent, it is important to know whether \(B_{21}^{AI}(\mathbf{x})\) corresponds to an actual Bayes factor for sensible prior. If so, consistency of the \(B_{21}^{AI}(\mathbf{x})\) is assured. With the so-called intrinsic priors, the above question has been answered asymptotically by Berger and Pericchi ( 1996 ). There are priors \(\pi _1^I(\theta _1)\) and \(\pi _2^I(\theta _2)\) for which the corresponding Bayes factor
and \(B_{21}^{AI}(\mathbf{x})\) are asymptotically equivalent under the two models \(M_1\) and \(M_2\) . Note that if we use intrinsic priors for computing the Bayes factor, instead of the improper priors we started from, coherency is assured.
Berger and Pericchi ( 1996 ) showed that intrinsic priors satisfy the functional equations
The expectations in these equations are taken with respect to \(f(x(l)\vert \theta _1)\) and \(f(x(l)\vert \theta _2)\) , respectively, \(\psi _2(\theta _1)\) denotes the limit of the maximum likelihood estimator \(\hat{\theta }_2(\mathbf{x})\) under model \(M_1\) at point \(\theta _1\) , and \(\psi _1(\theta _2)\) denotes the limit of the maximum likelihood estimator \(\hat{\theta }_1(\mathbf{x})\) under model \(M_2\) at point \(\theta _2\) . The main difficulty in considering intrinsic priors is that they might be not unique (Moreno, 1997). In nested models, Eq. ( 21 ) reduce to a single equation with two unknown functions so that it is apparent that the solution is not unique. A procedure for choosing a specific intrinsic prior was given in Moreno et al. ( 1998 ).
We first compute the Bayes factor for comparing model \(M_{i2}\) versus \(M_{i1}\) with the intrinsic prior \(\pi ^I(\varvec{\theta }_{i2})\) . Now
Integrating with respect to \(\tau _i\) in ( 22 ), then we obtain
Next we have
Integrating with respect to \(\mu _i\) in ( 24 ), then we get
where \(s_i^2=\sum _{j=1}^{n_i} (x_{ij}-\bar{x}_i)^2\) . Let \(w=\sigma _i^2\) and \(z={\tau _i^2/\sigma _i^2}\) . Integrating with respect to w in ( 25 ), then we get
Hence the Lemma 2 is proved.
Consider the models \(M_{(i)}\) ,
and the models \(M_{(k)}\) ,
where where \(c_{(i)}\) and \(c_{(k)}\) are an arbitrary positive constants, \(\varvec{\theta }_{(i)}=(\delta _{(1)},\cdots ,\delta _{(i)},\tau _{(1)},\cdots ,\) \(\tau _{(k)})\) and \(\varvec{\theta }_{(k)}=(\mu _{(1)},\cdots ,\mu _{(k)},\sigma _{(1)},\cdots ,\sigma _{(k)})\) . For the minimal training sample vector \(\mathbf{z}\) , we have
Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{(k)}\) becomes
Hence the Theorem 1 is proved.
Consider the models \(M_{(0)}\) ,
and the models \(M_{(i)}\) ,
where where \(c_{(0)}\) and \(c_{(i)}\) are an arbitrary positive constants, \(\varvec{\theta }_{(0)}=(\tau _{(1)},\) \(\cdots ,\tau _{(k)})\) and \(\varvec{\theta }_{(i)}=(\mu _{(1)},\cdots ,\mu _{(i)},\sigma _{(1)},\cdots ,\sigma _{(k)})\) . For the minimal training sample vector \(\mathbf{z}\) , we have
Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{(i)}\) becomes
Hence the Theorem 3 is proved.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Kang, S.G., Kim, Y. Objective Bayesian multiple testing for k normal populations. J. Korean Stat. Soc. (2024). https://doi.org/10.1007/s42952-024-00281-4
Download citation
Received : 15 February 2024
Accepted : 07 July 2024
Published : 29 July 2024
DOI : https://doi.org/10.1007/s42952-024-00281-4
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
IMAGES
VIDEO
COMMENTS
Let us look at the output of the same. Bayes Theorem Plot Bayes Theorem Output Real-World Example: Predicting Website Conversion Rates with Bayesian Inference. Let us now look at the case of a website.
Here we see two important things. First, we see that the prior is substantially wider than the likelihood, which occurs because there is much more data going into the likelihood (1000 data points) compared to the prior (100 data points), and more data reduces our uncertainty.
16.1 Hypothesis testing, relative evidence, and the Bayes factor. In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G.
This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses.A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters.
Lately I've been reading the excellent, open source book Probabilistic Programming and Bayesian Methods for Hackers.The book's prologue opens with the following line. The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis.
where larger values of B F 01 represent higher evidence in favor of the null hypothesis.. The Bayes factor can be viewed as a summary of the evidence given by data in support of one hypothesis in contrast to another [7,17].Reporting Bayes factors can be guided by setting customized thresholds according to particular applications.
Table of contents. Central Limit Theorem; Hypothesis Testing; p-Values; Confidence Intervals; Connecting dots with Python; Overview. This is a continuation of my progress through Data Science from Scratch by Joel Grus.
Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources
Quick-reference guide to the 17 statistical hypothesis tests that you need in applied machine learning, with sample code in Python. Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project. In this post, you will discover a cheat sheet for the most popular statistical
Bayesian A/B testing. bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.. Implemented tests: BinaryDataTest. Input data - binary data ([0, 1, 0, ...]; Designed for conversion-like data A/B testing. NormalDataTest. Input data - normal data with unknown variance; Designed for normal data A/B testing.
Bayes' Theorem (Source: Eric Castellanos Blog) where: P(A|B) is the posterior probability of the hypothesis. P(B|A) is he likelihood of the evidence given the hypothesis. P(A) is the prior probability of the hypothesis. P(B) is the total probability of the evidence.
Bayesian approaches have been on the radar of the software engineering research community since as early as 1999. Chunali's doctoral work [] applied Bayesian thinking for calibrating cost estimation models with prior expert knowledge.In 2010, Sridrahan and Namin [] delivered a tutorial at the International Conference on Software Engineering to advocate the use of Bayesian approaches for ...
A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.. This tutorial explains how to perform the following hypothesis tests in Python: One sample t-test; Two sample t-test; Paired samples t-test; Let's jump in!
Example Suppose that the random variable $X$ is transmitted over a communication channel. Assume that the received signal is given by \begin{align} Y=X+W, \end{align ...
Permutation sampling is a great way to simulate the hypothesis that two variables have identical probability distributions. This is often a hypothesis you want to test, so in this exercise, you will write a function to generate a permutation sample from two data sets.
I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups: control (A) disabled (B) enabled (C) Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related ...
As data scientists, Hypothesis Testing is expected to be well understood, but often not in reality. It is mainly because our textbooks blend two schools of thought - p-value and significance testing vs. hypothesis testing - inconsistently.
I think this frequentist idea is really important even when using Bayes in fields like psychology and linguistics, because our experiments are in fact replicable and should have the expected frequentist properties.
Within this set, we restrict our search for the most plausible family of true and false null hypotheses. In the objective Bayesian approach to multiple testing, the underlying nonnested models must be encompassed in one of two ways: by encompassing all the models into the full model with all the possible alternative hypotheses or by encompassing the simple model with only the null hypotheses ...