An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Statistics in Brief: The Importance of Sample Size in the Planning and Interpretation of Medical Research
David jean biau , md, solen kernéis , md, raphaël porcher , phd.
- Author information
- Article notes
- Copyright and License information
Corresponding author.
Received 2007 Nov 1; Accepted 2008 May 22; Issue date 2008 Sep.
The increasing volume of research by the medical community often leads to increasing numbers of contradictory findings and conclusions. Although the differences observed may represent true differences, the results also may differ because of sampling variability as all studies are performed on a limited number of specimens or patients. When planning a study reporting differences among groups of patients or describing some variable in a single group, sample size should be considered because it allows the researcher to control for the risk of reporting a false-negative finding (Type II error) or to estimate the precision his or her experiment will yield. Equally important, readers of medical journals should understand sample size because such understanding is essential to interpret the relevance of a finding with regard to their own patients. At the time of planning, the investigator must establish (1) a justifiable level of statistical significance, (2) the chances of detecting a difference of given magnitude between the groups compared, ie, the power, (3) this targeted difference (ie, effect size), and (4) the variability of the data (for quantitative data). We believe correct planning of experiments is an ethical issue of concern to the entire community.
Introduction
“Statistical analysis allows us to put limits on our uncertainty, but not to prove anything.”— Douglas G. Altman [ 1 ]
The growing need for medical practice based on evidence has generated an increasing medical literature supported by statistics: readers expect and presume medical journals publish only studies with unquestionable results they can use in their everyday practice and editors expect and often request authors provide rigorously supportable answers. Researchers submit articles based on presumably valid outcome measures, analyses, and conclusions claiming or implying the superiority of one treatment over another, the usefulness of a new diagnostic test, or the prognostic value of some sign. Paradoxically, the increasing frequency of seemingly contradictory results may be generating increasing skepticism in the medical community.
One fundamental reason for this conundrum takes root in the theory of hypothesis testing developed by Pearson and Neyman in the late 1920s [ 24 , 25 ]. The majority of medical research is presented in the form of a comparison, the most obvious being treatment comparisons in randomized controlled trials. To assess whether the difference observed is likely attributable to chance alone or to a true difference, researchers set a null hypothesis that there is no difference between the alternative treatments. They then determine the probability (the p value), they could have obtained the difference observed or a larger difference if the null hypothesis were true; if this probability is below some predetermined explicit significance level, the null hypothesis (ie, there is no difference) is rejected. However, regardless of study results, there is always a chance to conclude there is a difference when in fact there is not (Type I error or false positive) or to report there is no difference when a true difference does exist (Type II error or false negative) and the study has simply failed to detect it (Table 1 ). The size of the sample studied is a major determinant of the risk of reporting false-negative findings. Therefore, sample size is important for planning and interpreting medical research.
Type I and Type II errors during hypothesis testing
For that reason, we believe readers should be adequately informed of the frequent issues related to sample size, such as (1) the desired level of statistical significance, (2) the chances of detecting a difference of given magnitude between the groups compared, ie, the power, (3) this targeted difference, and (4) the variability of the data (for quantitative data). We will illustrate these matters with a comparison between two treatments in a surgical randomized controlled trial. The use of sample size also will be presented in other common areas of statistics, such as estimation and regression analyzes.
Desired Level of Significance
The level of statistical significance α corresponds to the probability of Type I error, namely, the probability of rejecting the null hypothesis of “no difference between the treatments compared” when in fact it is true. The decision to reject the null hypothesis is based on a comparison of the prespecified level of the test arbitrarily chosen with the test procedure’s p value. Controlling for Type I error is paramount to medical research to avoid the spread of new or perpetuation of old treatments that are ineffective. For the majority of hypothesis tests, the level of significance is arbitrarily chosen at 5%. When an investigator chooses α = 5%, if the test’s procedure p value computed is less than 5%, the null hypothesis will be rejected and the treatments compared will be assumed to be different.
To reduce the probability of Type I error, we may choose to reduce the level of statistical significance to 1% or less [ 29 ]. However, the level of statistical significance also influences the sample size calculation: the lower the chosen level of statistical significance, the larger the sample size will be, considering all other parameters remain the same (see example below and Appendix 1). Consequently, there are domains where higher levels of statistical significance are used so that the sample size remains restricted, such as for randomized Phase II screening designs in cancer [ 26 ]. We believe the choice of a significance level greater than 5% should be restricted to particular cases.
The power of a test is defined as 1 − the probability of Type II error. The Type II error is concluding at no difference (the null is not rejected) when in fact there is a difference, and its probability is named β. Therefore, the power of a study reflects the probability of detecting a difference when this difference exists. It is also very important to medical research that studies are planned with an adequate power so that meaningful conclusions can be issued if no statistical difference has been shown between the treatments compared. More power means less risk for Type II errors and more chances to detect a difference when it exists.
Power should be determined a priori to be at least 80% and preferably 90%. The latter means, if the true difference between treatments is equal to the one we planned, there is only 10% chance the study will not detect it. Sample size increases with increasing power (Fig. 1 ).
The graphs show the distribution of the test statistic (z-test) for the null hypothesis (plain line) and the alternative hypothesis (dotted line) for a sample size of ( A ) 32 patients per group, ( B ) 64 patients per group, and ( C ) 85 patients per group. For a difference in mean of 10, a standard deviation of 20, and a significance level α of 5%, the power (shaded area) increases from ( A ) 50%, to ( B ) 80%, and ( C ) 90%. It can be seen, as power increases, the test statistics yielded under the alternative hypothesis (there is a difference in the two comparison groups) are more likely to be greater than the critical value 1.96.
Very commonly, power calculations have not been performed before conducting the trial [ 3 , 8 ], and when facing nonsignificant results, investigators sometimes compute post hoc power analyses, also called observed power. For this purpose, investigators use the observed difference and variability and the sample size of the trial to determine the power they would have had to detect this particular difference. However, post hoc power analyses have little statistical meaning for three reasons [ 9 , 13 ]. First, because there is a one-to-one relationship between p values and post hoc power, the latter does not convey any additional information on the sample than the former. Second, nonsignificant p values always correspond to low power and post hoc power, at best, will be slightly larger than 50% for p values equal to or greater than 0.05. Third, when computing post hoc power, investigators implicitly make the assumption that the difference observed is clinically meaningful and more representative of the truth than the null hypothesis they precisely were not able to reject. However, in the theory of hypothesis testing, the difference observed should be used only to choose between the hypotheses stated a priori; a posteriori, the use of confidence intervals is preferable to judge the relevance of a finding. The confidence interval represents the range of values we can be confident to some extent includes the true difference. It is related directly to sample size and conveys more information than p values. Nonetheless, post hoc power analyses educate readers about the importance of considering sample size by explicitly raising the issue.
The Targeted Difference Between the Alternative Treatments
The targeted difference between the alternative treatments is determined a priori by the investigator, typically based on preliminary data. The larger the expected difference is, the smaller the required sample size will be. However, because the sample size based on the difference expected may be too large to achieve, investigators sometimes choose to power their trial to detect a difference larger than one would normally expect to reduce the sample size and minimize the time and resources dedicated to the trial. However, if the targeted difference between the alternative treatments is larger than the true difference, the trial may fail to conclude a difference between the two treatments when a smaller, and still meaningful, difference exists. This smallest meaningful difference sometimes is expressed as the “minimal clinically important difference,” namely, “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive costs, a change in the patient’s management” [ 15 ]. Because theoretically the minimal clinically important difference is a multidimensional phenomenon that encompasses a wide range of complex issues of a particular treatment in a unique setting, it usually is determined by consensus among clinicians with expertise in the domain. When the measure of treatment effect is based on a score, researchers may use empiric definitions of clinically meaningful difference. For instance, Michener et al. [ 21 ], in a prospective study of 63 patients with various shoulder abnormalities, determined the minimal change perceived as clinically meaningful by the patients for the patient self-report section of the American Shoulder and Elbow Surgeons Standardized Shoulder Assessment Form was 6.7 points of 100 points. Similarly, Bijur et al. [ 5 ], in a prospective cohort study of 108 adults presenting to the emergency department with acute pain, determined the minimal change perceived as clinically meaningful by patients for acute pain measured on the visual analog scale was 1.4 points. There is no reason to try to detect a difference below the minimal clinically important difference because, even if it proves statistically significant, it will not be meaningful.
The meaningful clinically important difference should not be confused with the effect size. The effect size is a dimensionless measure of the magnitude of a relation between two or more variables, such as Cohen’s d standardized difference [ 6 ], but also odds ratio, Pearson’s r correlation coefficient, etc. Sometimes studies are planned to detect a particular effect size instead of being planned to detect a particular difference between the two treatments. According to Cohen [ 6 ], 0.2 is indicative of a small effect, 0.5 a medium effect, and 0.8 a large effect size. One of the advantages of doing so is that researchers do not have to make any assumptions regarding the minimal clinically important difference or the expected variability of the data.
The Variability of the Data
For quantitative data, researchers also need to determine the expected variability of the alternative treatments: the more variability expected in the specified outcome, the more difficult it will be to differentiate between treatments and the larger the required sample size (see example below). If this variability is underestimated at the time of planning, the sample size computed will be too small and the study will be underpowered to the one desired. For comparing proportions, the calculation of sample size makes use of the expected proportion with the specified outcome in each group. For survival data, the calculation of sample size is based on the survival proportions in each treatment group at a specified time and on the total number of events in the group in which the fewer events occur. Therefore, for the latter two types of data, variability does not appear in the computation of sample size.
Presume an investigator wants to compare the postoperative Harris hip score [ 12 ] at 3 months in a group of patients undergoing minimally invasive THA with a control group of patients undergoing standard THA in a randomized controlled trial. The investigator must (1) establish a statistical significance level, eg, α = 5%, (2) select a power, eg, 1 − β = 90%, and (3) establish a targeted difference in the mean scores, eg, 10, and assume a standard deviation of the scores, eg, 20 in both groups (which they can obtain from the literature or their previous patients). In this case, the sample size should be 85 patients per group (Appendix 1). If fewer patients are included in the trial, the probability of detecting the targeted difference when it exists will decrease; for sample sizes of 64 and 32 per group, for instance, the power decreases to 80% and 50%, respectively (Fig. 1 ). If the investigator assumed the standard deviation of the scores in each group to be 30 instead of 20, a sample size of 190 per group would be necessary to obtain a power of 90% with a significance level α = 5% and targeted difference in the mean scores of 10. If the significance level was chosen at α = 1% instead of α = 5%, to yield the same power of 90% with a targeted difference in scores of 10 and standard deviation of 20, the sample size would increase from 85 patients per group to 120 patients per group. In relatively simple cases, statistical tables [ 19 ] and dedicated software available from the internet may be used to determine sample size. In most orthopaedic clinical trials cases, sample size calculation is rather simple as above, but it will become more complex in other cases. The type of end points, the number of groups, the statistical tests used, whether the observations are paired, and other factors influence the complexity of the calculation, and in these cases, expert statistical advice is recommended.
Sample Size, Estimation, and Regression
Sample size was presented above in the context of hypothesis testing. However, it is also of interest in other areas of biostatistics, such as estimation or regression. When planning an experiment, researchers should ensure the precision of the anticipated estimation will be adequate. The precision of an estimation corresponds to the width of the confidence interval: the larger the tested sample size is, the better the precision. For instance, Handl et al. [ 11 ], in a biomechanical study of 21 fresh-frozen cadavers, reported a mean ultimate load failure of four-strand hamstring tendon constructs of 4546 N under loading with a standard deviation of 1500 N. Based on these values, if we were to design an experiment to assess the ultimate load failure of a particular construct, the precision around the mean at the 95% confidence level would be expected to be 3725 N for five specimens, 2146 N for 10 specimens, 1238 N for 25 specimens, 853 N for 50 specimens, and 595 N for 100 specimens tested (Appendix 2); if we consider the estimated mean will be equal to 4546 N, the one obtained in the previous experiment, we could obtain the corresponding 95% confidence intervals (Fig. 2 ). Because we always deal with limited samples, we never exactly know the true mean or standard deviation of the parameter distribution; otherwise, we would not perform the experiment. We only approximate these values, and the results obtained can vary from the planned experiment. Nonetheless, what we identify at the time of planning is that testing more than 50 specimens, for instance 100, will multiply the costs and time necessary to the experiment while providing only slight improvement in the precision.
The graph shows the predicted confidence interval for experiments with an increasing number of specimens tested based on the study by Handl et al. [ 11 ] of 21 fresh-frozen cadavers with a mean ultimate load failure of four-strand hamstring tendon constructs of 4546 N and standard deviation of 1500 N.
Similarly, sample size issues should be considered when performing regression analyses, namely, when trying to assess the effect of a particular covariate, or set of covariates, on an outcome. The effective power to detect the significance of a covariate in predicting this outcome depends on the outcome modeled [ 14 , 30 ]. For instance, when using a Cox regression model, the power of the test to detect the significance of a particular covariate does not depend on the size of the sample per se but on the number of specific critical events. In a cohort study of patients treated for soft tissue sarcoma with various treatments, such as surgery, radiotherapy, chemotherapy, etc, the power to detect the effect of chemotherapy on survival will depend on the number of patients who die, not on the total number of patients in the cohort. Therefore, when planning such studies, researchers should be familiar with these issues and decide, for example, to model a composite outcome, such as event-free survival that includes any of the following events: death from disease, death from other causes, recurrence, metastases, etc, to increase the power of the test.
The reasons to plan a trial with an adequate sample size likely to give enough power to detect a meaningful difference are essentially ethical. Small trials are considered unethical by most, but not all, researchers because they expose participants to the burdens and risks of human research with a limited chance to provide any useful answers [ 2 , 10 , 28 ]. Underpowered trials also ineffectively consume resources (human, material) and add to the cost of healthcare to society. Although there are particular cases when trials conducted on a small sample are justified, such as early-phase trials with the aim of guiding the conduct of subsequent research (or formulating hypotheses) or, more rarely, for rare diseases with the aim of prospectively conducting meta-analyses, they generally should be avoided [ 10 ]. It is also unethical to conduct trials with too large a sample size because, in addition to the waste of time and resources, they expose participants in one group to receive inadequate treatment after appropriate conclusions should have been reached. Interim analyses and adaptive trials have been developed in this context to shorten the time to decision and overcome these concerns [ 4 , 16 ].
We raise two important points. First, we explained, for practical and ethical reasons, experiments are conducted on a sample of limited size with the aim to generalize the results to the population of interest and increasing the size of the sample is a way to combat uncertainty. When doing this, we implicitly consider the patients or specimens in the sample are randomly selected from the population of interest, although this is almost never the case; even if it were the case, the population of interest would be limited in space and time. For instance, Marx et al. [ 20 ], in a survey conducted in late 1998 and early 1999, assessed the practices for anterior cruciate ligament reconstruction on a randomly selected sample of 725 members of the American Academy of Orthopaedic Surgeons; however, because only ½ the surgeons responded to the survey, their sample probably is not representative of all members of the society, who in turn are not representative of all orthopaedic surgeons in the United States, who again are not representative of all surgeons in the world because of the numerous differences among patients, doctors, and healthcare systems across countries. Similar surveys conducted in other countries have provided different results [ 17 , 22 ]. Moreover, if the same survey was conducted today, the results would possibly differ. Therefore, another source for variation among studies, apart from sampling variability, is that samples may not be representative of the same population. Therefore, when planning experiments, researchers must take care to make their sample representative of the population they want to infer to and readers, when interpreting the results of a study, should always assess first how representative the sample presented is regarding their own patients. The process implemented to select the sample, the settings of the experiment, and the general characteristics and influencing factors of the patients must be described precisely to assess representativeness and possible selection biases [ 7 ].
Second, we have discussed only sample size for interpreting nonsignificant p values, but it also may be of interest when interpreting p values that are significant. Significant results issued from larger studies usually are given more credit than those from smaller studies because of the risk of reporting exaggerating treatment effects with studies with smaller samples or of lower quality [ 23 , 27 ], and small trials are believed to be more biased than others. However, there is no statistical reason a significant result in a trial including 2000 patients should be given more belief than a trial including 20 patients, given the significance level chosen is the same in both trials. Small but well-conducted trials may yield a reliable estimation of treatment effect. Kjaergard et al. [ 18 ], in a study of 14 meta-analyses involving 190 randomized trials, reported small trials (fewer than 1000 patients) reported exaggerated treatment effects when compared with large trials. However, when considering only small trials with adequate randomization, allocation concealment (allocation concealment is the process that keeps clinicians and participants unaware of upcoming assignments. Without it, even properly developed random allocation sequences can be subverted), and blinding, this difference became negligible. Nonetheless, the advantages of a large sample size to interpret significant results are it allows a more precise estimate of the treatment effect and it usually is easier to assess the representativeness of the sample and to generalize the results.
Sample size is important for planning and interpreting medical research and surgeons should become familiar with the basic elements required to assess sample size and the influence of sample size on the conclusions. Controlling for the size of the sample allows the researcher to walk a thin line that separates the uncertainty surrounding studies with too small a sample size from studies that have failed practical or ethical considerations because of too large a sample size.
Acknowledgments
We thank the editor whose thorough readings of, and accurate comments on drafts of the manuscript have helped clarify the manuscript.
The sample size (n) per group for comparing two means with a two-sided two-sample t test is
where z 1−α/2 and z 1−β are standard normal deviates for the probability of 1 − α/2 and 1 − β, respectively, and d t = (μ 0 − μ 1 )/σ is the targeted standardized difference between the two means.
The following values correspond to the example:
α = 0.05 (statistical significance level)
β = 0.10 (power of 90%)
|μ 0 − μ 1 | = 10 (difference in the mean score between the two groups)
σ = 20 (standard deviation of the score in each group)
z 1−α/2 = 1.96
z 1−β = 1.28
Two-sided tests which do not assume the direction of the difference (ie, that the mean value in one group would always be greater than that in the other) are generally preferred. The null hypothesis makes the assumption that there is no difference between the treatments compared, and a difference on one side or the other therefore is expected.
Computation of Confidence Interval
To determine the estimation of a parameter, or alternatively the confidence interval, we use the distribution of the parameter estimate in repeated samples of the same size. For instance, consider a parameter with observed mean, m, and standard deviation, sd, in a given sample. If we assume that the distribution of the parameter in the sample is close to a normal distribution, the means, x n , of several repeated samples of the same size have true mean, μ, the population mean, and estimated standard deviation,
also known as standard error of the mean, and
follows a t distribution. For a large sample, the t distribution becomes close to the normal distribution; however, for a smaller sample size the difference is not negligible and the t distribution is preferred. The precision of the estimation is
and the confidence interval for μ is the range of values extending either side of the sample mean m by
For example, Handl et al. [ 11 ] in a biomechanical study of 21 fresh-frozen cadavers reported a mean ultimate load failure of 4-strand hamstring tendon constructs of 4546 N under dynamic loading with standard deviation of 1500 N. If we were to plan an experiment, the anticipated precision of the estimation at the 95% level would be
for five specimens,
The values 2.78, 2.26, 2.06, 2.01, and 1.98 correspond to the t distribution deviates for the probability of 1 − α/2, with 4, 9, 24, 49, and 99 (n − 1) degrees of freedom; the well known corresponding standard normal deviate is 1.96. Given an estimated mean of 4546 N, the corresponding 95% confidence intervals are 2683 N to 6408 N for five specimens, 3473 N to 5619 N for 10 specimens, 3927 N to 5165 N for 25 specimens, 4120 N to 4972 N for 50 specimens, and 4248 N to 4844 N for 100 specimens (Fig. 2 ).
Similarly, for a proportion p in a given sample with sufficient sample size to assume a nearly normal distribution, the confidence interval extends either side of the proportion p by
For a small sample size, exact confidence interval for proportions should be used.
Each author certifies that he or she has no commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.
- 1. Altman DG. Practical Statistics for Medical Research. London, UK: Chapman & Hall; 1991.
- 2. Bacchetti P, Wolf LE, Segal MR, McCulloch CE. Ethics and sample size. Am J Epidemiol. 2005;161:105–110. [ DOI ] [ PubMed ]
- 3. Bailey CS, Fisher CG, Dvorak MF. Type II error in the spine surgical literature. Spine. 2004;29:1146–1149. [ DOI ] [ PubMed ]
- 4. Bauer P, Brannath W. The advantages and disadvantages of adaptive designs for clinical trials. Drug Discov Today. 2004;9:351–357. [ DOI ] [ PubMed ]
- 5. Bijur PE, Latimer CT, Gallagher EJ. Validation of a verbally administered numerical rating scale of acute pain for use in the emergency department. Acad Emerg Med. 2003;10:390–392. [ DOI ] [ PubMed ]
- 6. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Earlbaum Associates; 1988.
- 7. Ellenberg JH. Selection bias in observational and experimental studies. Stat Med. 1994;13:557–567. [ DOI ] [ PubMed ]
- 8. Freedman KB, Back S, Bernstein J. Sample size and statistical power of randomised, controlled trials in orthopaedics. J Bone Joint Surg Br. 2001;83:397–402. [ DOI ] [ PubMed ]
- 9. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121:200–206. [ DOI ] [ PubMed ]
- 10. Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358–362. [ DOI ] [ PubMed ]
- 11. Handl M, Drzik M, Cerulli G, Povysil C, Chlpik J, Varga F, Amler E, Trc T. Reconstruction of the anterior cruciate ligament: dynamic strain evaluation of the graft. Knee Surg Sports Traumatol Arthrosc. 2007;15:233–241. [ DOI ] [ PubMed ]
- 12. Harris WH. Traumatic arthritis of the hip after dislocation and acetabular fractures: treatment by mold arthroplasty: an end-result study using a new method of result evaluation. J Bone Joint Surg Am. 1969;51:737–755. [ PubMed ]
- 13. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician. 2001;55:19–24. [ DOI ]
- 14. Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Stat Med. 1998;17:1623–1634. [ DOI ] [ PubMed ]
- 15. Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10:407–415. [ DOI ] [ PubMed ]
- 16. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC; 2000.
- 17. Kapoor B, Clement DJ, Kirkley A, Maffulli N. Current practice in the management of anterior cruciate ligament injuries in the United Kingdom. Br J Sports Med. 2004;38:542–544. [ DOI ] [ PMC free article ] [ PubMed ]
- 18. Kjaergard LL, Villumsen J, Gluud C. Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Ann Intern Med. 2001;135:982–989. [ DOI ] [ PubMed ]
- 19. Machin D, Campbell MJ. Statistical Tables for the Design of Clinical Trials. Oxford, UK: Blackwell Scientific Publications; 1987.
- 20. Marx RG, Jones EC, Angel M, Wickiewicz TL, Warren RF. Beliefs and attitudes of members of the American Academy of Orthopaedic Surgeons regarding the treatment of anterior cruciate ligament injury. Arthroscopy. 2003;19:762–770. [ DOI ] [ PubMed ]
- 21. Michener LA, McClure PW, Sennett BJ. American Shoulder and Elbow Surgeons Standardized Shoulder Assessment Form, patient self-report section: reliability, validity, and responsiveness. J Shoulder Elbow Surg. 2002;11:587–594. [ DOI ] [ PubMed ]
- 22. Mirza F, Mai DD, Kirkley A, Fowler PJ, Amendola A. Management of injuries to the anterior cruciate ligament: results of a survey of orthopaedic surgeons in Canada. Clin J Sport Med. 2000;10:85–88. [ DOI ] [ PubMed ]
- 23. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998;352:609–613. [ DOI ] [ PubMed ]
- 24. Pearson J, Neyman ES. On the use, interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika. 1928;20A:175–240.
- 25. Pearson ES, Neyman J. On the use, interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika. 1928;20A:263–294.
- 26. Rubinstein LV, Korn EL, Freidlin B, Hunsberger S, Ivy SP, Smith MA. Design issues of randomized phase II trials and a proposal for phase II screening trials. J Clin Oncol. 2005;23:7199–7206. [ DOI ] [ PubMed ]
- 27. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995;273:408–412. [ DOI ] [ PubMed ]
- 28. Stenning SP, Parmar MK. Designing randomised trials: both large and small trials are needed. Ann Oncol. 2002;13(suppl 4):131–138. [ DOI ] [ PubMed ]
- 29. Sterne JA, Davey Smith G. Sifting the evidence: what’s wrong with significance tests? BMJ. 2001;322:226–231. [ DOI ] [ PMC free article ] [ PubMed ]
- 30. Vaeth M, Skovlund E. A simple approach to power and sample size calculations in logistic regression and Cox regression models. Stat Med. 2004;23:1781–1792. [ DOI ] [ PubMed ]
- View on publisher site
- PDF (236.4 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Small Sample Research: Considerations Beyond Statistical Power
- Published: 19 August 2015
- Volume 16 , pages 1033–1036, ( 2015 )
Cite this article
- Kathleen E. Etz 1 &
- Judith A. Arroyo 2
23k Accesses
Explore all metrics
Small sample research presents a challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. Yet, small sample research is critically important as the research questions posed in small samples often represent serious health concerns in vulnerable and underrepresented populations. This commentary considers the Special Section on small sample research and also highlights additional challenges that arise in small sample research not considered in the Special Section, including generalizability, determining what constitutes knowledge, and ensuring that research designs match community desires. It also points to opportunities afforded by small sample research, such as a focus on and increased understanding of context and the emphasis it may place on alternatives to the randomized clinical trial. The commentary urges the development and adoption of innovative strategies to conduct research with small samples.
Avoid common mistakes on your manuscript.
Small sample research presents a direct challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. While we can have confidence that our scientific methods have the ability to answer many research questions, we have been limited in our ability to take on research with small samples because we have not developed or adopted the means to support rigorous small sample research. This Special Section identifies some tools that can be used for small sample research. It reminds us that progress in this area will likely require expansion of our ideas of what constitutes rigor in analysis and design strategies that address the unique characteristics and accompanying challenges of small sample research. Advances will also require making room for the adoption of innovative design and statistical analysis approaches. The collection of papers makes a significant contribution to the literature and marks major development in the field.
Innovations in small sample research are particularly critical because the research questions posed in small samples often focus on serious health concerns in vulnerable populations. Individuals most at risk for or afflicted by health disparities (e.g., racial and ethnic minorities) are by definition small in number when compared to the larger, dominant society. The current state of the art in design and statistical analysis in prevention science, which is highly dependent on large samples, has severely handicapped investigation of health disparities in these smaller populations. Unless we develop research techniques suitable for small group design and expand our concepts of what design and analytic strategies provide sufficient scientific rigor, health disparities will continue to lay waste to populations that live in smaller communities or who are difficult to recruit in large numbers. Particularly when considering high-risk, low base rate behaviors such as recurrent binge drinking or chronic drug use, investigators are often limited by small populations in many health disparity groups and by small numbers of potential participants in towns, villages, and rural communities. Even in larger, urban settings, researchers may experience constraints on recruitment such as difficulty identifying a sufficiently large sample, distrust of research, lack of transportation or time outside of work hours, or language issues. Until now, small sample sizes and the lack of accepted tools for small sample research have decreased our ability to harness the power of science to research preventive solutions to health disparities. The collection of articles in this Special Section helps to address this by bringing together multiple strategies and demonstrating their strength in addressing research questions with small samples.
Small sample research issues also arise in multi-level, group-based, or community-level intervention research (Trickett et al. 2011 ). An example of this is a study that uses a media campaign and compares the efficacy of that campaign across communities. In such cases, the unit of analysis is the group, and the limited number of units that can be feasibly involved in a study makes multi-level intervention research inevitably an analysis of small samples. The increasingly recognized importance of intervening in communities at multiple levels (Frohlich and Potvin 2008 ) and the desire to understand the efficacy and effectiveness of multi-level interventions (Hawe 1994 ) increase the need to devise strategies for assessing interventions conducted with small samples.
The Special Section makes a major contribution to small sample research, identifying tools that can be used to address small sample design and analytic challenges. The articles here can be grouped into four areas: (1) identification of refinements in statistical applications and measurement that can facilitate analyses with small samples, (2) alternatives to randomized clinical trial (RCT) designs that maintain rigor while maximizing power, (3) use of qualitative and mixed methods, and (4) Bayesian analysis. The Special Section provides a range of alternative strategies to those that are currently employed with larger samples. The first and last papers in the Special Section (Fok et al. 2015 ; Henry et al. 2015a ) examine and elaborate on the contributions of these articles to the field. As this is considered elsewhere, we will focus our comments more on issues that are not already covered but that will be increasingly important as this field moves forward.
One challenge that is not addressed by the papers in this Special Section is the generalizability of small sample research findings, particularly when working with culturally distinct populations. Generalizability poses a different obstacle than those associated with design and analysis, in that it is not related to rigor or the confidence we can have in our conclusions. Rather, it limits our ability to assume the results will apply to populations other than those from whom a sample is drawn and, as such, can limit the application of the work. The need to discover prevention solutions for all people, even if they happen to be members of a small population, begs questions of the value of generalizability and of the importance ascribed to it. Further, existing research raises long-standing important questions about whether knowledge produced under highly controlled conditions can generalize to ethnoculturally diverse communities (Atkins et al. 2006 ; Beeker et al. 1998 ; Green and Glasgow 2006 ). Regardless, the inability to generalize beyond a small population can present a barrier to funding. When grant applications are reviewed, projects that are not seen as widely generalizable often receive poor ratings. Scientists conducting small sample research with culturally distinct groups are frequently stymied by how they can justify their research when it is not generalizable to large segments of the population. In some instances, the question that drives the research is that which limits generalizability. For example, research projects on cultural adaptations of established interventions are often highly specific. An adaptation that might be efficacious in one small sample might not be so in other contexts. This is particularly the case if the adaptation integrates local culture, such as preparing for winter and subsistence activities in Alaska or integrating the horse culture of the Great Plains. Even if local adaptation is not necessary, dissemination research to ascertain the efficacy and/or effectiveness of mainstream, evidence-based interventions when applied to diverse groups will be difficult to conduct if we cannot address concerns about generalizability.
It is not readily apparent how to address issues of generalizability, but it is clear that this will be challenging and will require creativity. One potential strategy is to go beyond questions of intervention efficacy to address additional research questions that have the potential to advance the field more generally. For example, Allen and colleagues’ ( 2014 ) scientific investigations extended beyond development of a prevention intervention in Alaska Native villages to identification and testing of the underlying prevention processes that were at the core of the culturally specific intervention. This isolation of the key components of the prevention process has the potential to inform and generalize across settings. The development of new statistical tools for small culturally distinct samples might also be helpful in other research contexts. Similarly, the identification of the most potent prevention processes for adaptation also might generalize. As small sample research evolves, we must remain open to how this work has the potential to be highly valuable despite recognizing that not all aspects of it will generalize and also take care to identify what can be applied generally.
While not exclusive to small sample research, additional difficulties that can arise in conducting research in some small, culturally distinct samples are the questions of what constitutes knowledge and how to include alternative forms of knowledge (e.g., indigenous ways of knowing, folk wisdom) in health research (Aikenhead and Ogawa 2007 ; Gone 2012 ). For many culturally distinct communities that turn to research to address their health challenges, the need for large samples and methods demanded by mainstream science might be incongruent with local epistemologies and cultural understandings of how the knowledge to inform prevention is generated and standards of evidence are established. Making sense of how or whether indigenous knowledge and western scientific approaches can work together is an immense challenge. The Henry, Dymnicki, Mohatt, Kelly, and Allen article in this Special Section recommends combining qualitative and quantitative methods as one way to address this conundrum. However, this strategy is not sufficient to address all of the challenges encountered by those who seek to integrate traditional knowledge into modern scientific inquiry. For culturally distinct groups who value forms of knowledge other than those generated by western science, the research team, including the community members, will need to work together to identify ways to best ensure that culturally valued knowledge is incorporated into the research endeavor. The scientific field will need to make room for approaches that stem from the integration of culturally valued knowledge.
Ensuring that the research design and methods correspond to community needs and desires can present an additional challenge. Investigations conducted with small, culturally distinct groups often use community-based participatory research (CBPR) approaches (Minkler and Wallerstein 2008 ). True CBPR mandates that community partners be equal participants in every phase of the research, including study design. From an academic researcher’s perspective, the primary obstacle for small sample research may be insufficient statistical power to conduct a classic RCT. However, for the small group partner, the primary obstacle may be the RCT design itself. Many communities will not allow a RCT because assignment of some community members to a no-treatment control condition can violate culturally based ethical principles that demand that all participants be treated equally. Particularly in communities experiencing severe health disparities, community members may want every person to receive the active intervention. While the RCT has become the gold standard because it is believed to be the most rigorous test of intervention efficacy, it is clear the RCT does not serve the needs of all communities.
While presenting challenges for current methods, it is important to note that small sample research can also expand our horizons. For example, attempts to truly comprehend culturally distinct groups will lead to a better understanding of the role of context in health outcomes. Current approaches more often attempt to control for extraneous variables rather than work to more accurately model potentially rich contextual variables. This blinds us to cultural differences between and among small groups that might contribute to outcomes and improve health. Analytical strategies that mask these nuances will fail to detect information about risk and resilience factors that could impact intervention. Multi-level intervention research (which we pointed out earlier qualifies as small sample research) that focuses on contextual changes as well as or instead of change in the individual will also inform our understanding of context, elucidating how to effectively intervene to change context to promote health outcomes. Thus, considering how prevailing methods limit our work in small samples can also expose ways that alternative methods may advance our science more broadly by enhancing both our understanding of context and how to intervene in context.
Small sample science requires us to consider alternatives to the RCT, and this consideration introduces additional opportunities. The last paper in this Special Section (Henry et al. 2015b ) notes compelling critiques of RCT. Small sample research demands we incorporate alternate strategies that may be superior in some instances regarding their efficiency in their use of available information, in contrast to the classic RCT, and may be more aligned with community desires. Alternative designs for small sample research may offer means to enhance and ensure scientific rigor without depending on RCT design (Srinivasan et al. 2015 ). It is important to consider what alternative approaches can contribute rather than adhering rigidly to the RCT.
New challenges require innovative solutions. Innovation is the foundation of scientific advances. It is one of only five National Institutes of Health grant review criteria. Despite the value to science of innovation, research grant application reviewers are often skeptical of new strategies and are reluctant to support risk taking in science. As a field, we seem accustomed to the use of certain methods and statistics, generally accepting and rarely questioning if they are the best approach. Yet, it is clear that common methods that work well with large samples are not always appropriate for small samples. Progress will demand that new approaches be well justified and also that the field supports innovation and the testing of alternative approaches. Srinivasan and colleagues ( 2015 ) further recommend that it might be necessary to offer training to grant application peer reviewers on innovative small sample research methods, thus ensuring that they are knowledgeable in this area and score grant applications appropriately. Alternative approaches need to be accepted into the repertoire of available design and assessment tools. The articles in this Special Section all highlight such innovation for small sample research.
It would be a failure of science and the imagination if newly discovered or re-discovered (i.e., Bayesian) strategies are not employed to facilitate rigorous assessment of interventions in small samples. It is imperative that the tools of science do not limit our ability to address pressing public health questions. New approaches can be used to address contemporary research questions, including providing solutions to the undue burden of disease that can and often does occur in small populations. It must be the pressing nature of the questions, not the limitations of our methods, that determines what science is undertaken (see also Srinivasan et al. 2015 ). While small sample research presents a challenge for prevailing scientific approaches, the papers in this Special Section identify ways to move this science forward with rigor. It is imperative that the field accommodates these advances, and continues to be innovative in response to the challenge of small sample research, to ensure that science can provide answers for those most in need.
Aikenhead, G. S., & Ogawa, M. (2007). Indigenous knowledge and science revisited. Cultural Studies of Science Education, 2 , 539–620.
Article Google Scholar
Allen, J., Mohatt, G. V., Fok, C. C. T., Henry, D., Burkett, R., & People Awakening Project. (2014). A protective factors model for alcohol abuse and suicide prevention among Alaska Native youth. American Journal of Community Psychology, 54 , 125–139.
Article PubMed Central PubMed Google Scholar
Atkins, M. S., Frazier, S. L., & Cappella, E. (2006). Hybrid research models: Natural opportunities for examining mental health in context. Clinical Psychology Review, 13 , 105–108.
Google Scholar
Beeker, C., Guenther-Grey, C., & Raj, A. (1998). Community empowerment paradigm drift and the primary prevention of HIV/AIDS. Social Science & Medicine, 46 , 831–842.
Article CAS Google Scholar
Fok, Henry, D., Allen, J. (2015). Maybe small is too small a term: Introduction to advancing small sample prevention science. Prevention Science .
Frohlich, K. L., & Potvin, L. (2008). Transcending the known in public health practice: The inequality paradox: The population approach and vulnerable populations. American Journal of Public Health, 98 , 216–221.
Gone, J. P. (2012). Indigenous traditional knowledge and substance abuse treatment outcomes: The problem of efficacy evaluation. American Journal of Drug and Alcohol Abuse, 38 , 493–497.
Article PubMed Google Scholar
Green, L. W., & Glasgow, R. E. (2006). Evaluating the relevance, generalization, and applicability of research: Issues in external validation and translation methodology. Evaluation & the Health Professions, 29 , 126–153.
Hawe, P. (1994). Capturing the meaning of “community” in community intervention evaluation: Some contributions from community psychology. Health Promotion International, 9 , 199–210.
Henry, D., Dymnicki, A. B., Mohatt, N., Kelly, J. G., & Allen, J. (2015a). Clustering methods with qualitative data: A mixed methods approach for prevention research with small samples. Prevention Science . doi: 10.1007/s11121-015-0561-z .
Henry, D., Fok, C.C.T., Allen, J. (2015). Why small is too small a term: Prevention science for health disparities, culturally distinct groups, and community-level intervention. Prevention Science.
Minkler, M., & Wallerstein, N. (Eds.). (2008). Community-based participatory research for health: From process to outcomes (2nd ed.). San Francisco: Jossey-Bass.
Srinivasan, S., Moser, R. P., Willis, G., Riley, W., Alexander, M., Berrigan, D., & Kobrin, S. (2015). Small is essential: Importance of subpopulation research in cancer control. American Journal of Public Health, 105 , 371–373.
Trickett, E. J., Beehler, S., Deutsch, C., Green, L. W., Hawe, P., McLeroy, K., Miller, R. L., Rapkin, B. D., Schensul, J. J., Schulz, A. J., & Trimble, J. E. (2011). Advancing the science of community-level interventions. American Journal of Public Health, 11 , 1410–1419.
Download references
Compliance with Ethical Standards
No external funding supported this work.
Conflict of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Because this article is a commentary, informed consent is not applicable.
Author information
Authors and affiliations.
National Institute on Drug Abuse, National Institutes of Health, 6001 Executive Blvd., Bethesda, MD, 20852, USA
Kathleen E. Etz
National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, 5635 Fishers Lane, Bethesda, MD, 20852, USA
Judith A. Arroyo
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Kathleen E. Etz .
Additional information
The opinions and conclusions here represent those of the authors and do not represent the National Institutes of Health, the National Institute on Drug Abuse, the National Institute on Alcohol Abuse and Alcoholism, or the US Government.
Rights and permissions
Reprints and permissions
About this article
Etz, K.E., Arroyo, J.A. Small Sample Research: Considerations Beyond Statistical Power. Prev Sci 16 , 1033–1036 (2015). https://doi.org/10.1007/s11121-015-0585-4
Download citation
Published : 19 August 2015
Issue Date : October 2015
DOI : https://doi.org/10.1007/s11121-015-0585-4
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Find a journal
- Publish with us
- Track your research
The importance of small samples in medical research
Affiliation.
- 1 Clinical Research Department, Max Healthcare, New Delhi, India.
- PMID: 34845889
- PMCID: PMC8706541
- DOI: 10.4103/jpgm.JPGM_230_21
Almost all bio-statisticians and medical researchers believe that a large sample is always helpful in providing more reliable results. Whereas this is true for some specific cases, a large sample may not be helpful in more situations than we contemplate because of the higher possibility of errors and reduced validity. Many medical breakthroughs have occurred with self-experimentation and single experiments. Studies, particularly analytical studies, may provide more truthful results with a small sample because intensive efforts can be made to control all the confounders, wherever they operate, and sophisticated equipment can be used to obtain more accurate data. A large sample may be required only for the studies with highly variable outcomes, where an estimate of the effect size with high precision is required, or when the effect size to be detected is small. This communication underscores the importance of small samples in reaching a valid conclusion in certain situations and describes the situations where a large sample is not only unnecessary but may even compromise the validity by not being able to exercise full care in the assessments. What sample size is small depends on the context.
Keywords: Medical research; n= 1 self-experiments; small sample.
- Biomedical Research*
- Sample Size
- ---Citations---
Translate this page into:
Download Citations
Sample size and its evolution in research
*Corresponding author: Anthony Vipin Das, Department of eyeSmart EMR and AEye L V Prasad Eye Institute, Hyderabad, Telangana, [email protected]
How to cite this article: How to cite this article: Gumpili SP, Das AV. Sample size and its evolution in research. IHOPE J Ophthalmol 2022;1:9-13.
Sample size is one of the crucial and basic steps involved in planning any study. This article aims to study the evolution of sample size across the years from hundreds to thousands to millions and to a trillion in the near future (H-K-M-B-T). It also aims to understand the importance of sampling in the era of big data.
Study Design - Primary Outcome measure, Methods, Results, and Interpretation:
A sample size which is too small will not be a true representation of the population whereas a large sample size will involve putting more individuals at risk. An optimum sample size needs to be employed to identify statistically significant differences if they exist and obtain scientifically valid results.
The design of the study, the primary outcome, sampling method used, dropout rate, effect size, power, level of significance, and standard deviation are some of the multiple factors which affect the sample size. All these factors need to be taken into account while calculating the sample size. Many sources are available for calculating sample size. Discretion needs to be used while choosing the right source. The large volumes of data and the corresponding number of data points being analyzed is redefining many industries including healthcare. The larger the sample size, the more insightful information, identification of rare side effects, lesser margin of error, higher confidence level, and models with more accuracy. Advances in the digital era have ensured that we do not face most of the obstacles faced traditionally with regards to statistical sampling, yet it has its own set of challenges. Hence, considerable efforts and time should be invested in selecting sampling techniques which are appropriate and reducing sampling bias and errors. This will ensure the reliability and reproducibility in the results obtained. Along with a large sample size, the focus should be on getting to know the data better, the sample frame and the context in which it was collected. We need to focus on creation of good quality data and structured systems to capture the sample. Good data quality management makes sure that the data are structured appropriately.
Sample size
Power of study, effect size, study design.
- Sample size and its importance in research
In statistics, the term “population” is defined as an entire group of events or items which is of interest to our research question. Since it is not feasible to study the entire population, a subset of the population is chosen to adequately represent the same. This subset is defined as the sample. Every individual in the chosen population should have an equal chance of being selected. Sample size which is typically denoted by signifies the total number of observations or participants included in a study.
One of the key steps involved in a clinical study is calculation of the sample size. If the sample size is very small, the sample is not a true representation of our population and the results obtained cannot be extrapolated to the entire population. In addition, the differences between the groups will not be detected if the sample size is too small as they state that “absence of evidence is no evidence of absence.” If our sample size is larger than required, it involves putting more individuals at risk of that particular intervention, which is highly unethical. Small differences manifest into clinically significant differences which can potentially be misguiding and may lead to grave consequences like failure in making the right decision about treatments. [ 1 ] It is also a huge waste of finances, human resources, and time. Further, saturation is defined as the point after which collection of any data will no longer yield any new results. [ 2 ] Saturation is dependent on various factors such as homogeneity or heterogeneity of the population being studied, criteria used for selection, financial resources available, and timelines set. All these factors need to be carefully considered before the start of any study.
The central limit theorem states that irrespective of the distribution of the population, as the sample size increases the distribution of sample approximate a normal distribution (“bell curve”). [ 3 ] Therefore, as sample size increases, mean and standard deviation of the sample will be closer in value to the mean (μ) and standard deviation (σ) of the population.
An optimum sample size is the minimum number of individuals needed to identify any statistically significant difference if it truly exists and a means by which we attain results which are valid scientifically. A fine balance needs to be maintained and an optimum sample size needs to be arrived at. Therefore, the sample size lies at the heart of any study.
- FACTORS AFFECTING SAMPLE SIZE CALCULATION
Sample size impacts the precision of our estimates and the power of our study to arrive at any conclusion. The sample size for any study depends on the design of the study, the primary outcome that is being studied (continuous or binary), one-tailed or two-tailed test, sampling method used, dropout rate and the measures of outcome such as effect size, power, level of significance, and standard deviation. [ 4 - 7 ] Descriptive studies such as surveys, case-series, case-reports and questionnaires require a larger sample size when compared to analytical studies. The sample size in methods of qualitative research is often smaller than that used in quantitative research. [ 8 ] Observational studies need more samples than experimental studies. [ 9 ]
As the effect of the size which has to be detected decreases, the sample size increases and vice versa. If the population being studied is more homogenous, it implies lesser standard deviation, and hence smaller sample size. More heterogeneous population entails a large sample size to get accurate results.
Before the start of the study, we set an acceptable value of level of significance ( P -value). P = 0.05 indicates that there is a 5% probability that the observed results are due to chance and not due to the intervention (False positive result, Type I error). In other words, 5 out of 100 times we accept that there is a difference when in fact there is none. As the level of significance decreases, the sample size increases. In an exactly converse situation, there are chances of failing to detect the difference even when it is actually present (False negative result and Type II error). The probability of committing a type II error is called beta (β). (1-β) is called power, which is defined as probability of failing to detect a difference even though it is there. As the desired power value increases, the sample size also increases. The two most applicable type of power analyses are a priori and post hoc power analysis. As the name suggests a priori analysis is performed before the experiment is conducted as a part of the process of research design. [ 10 ] Post hoc analysis is performed after the study has been conducted.
During the calculation of sample size, we need to accept the risk of a false negative or a false positive result, if not we would need a sample size which is infinitely large. It was observed that for most trials whose results were negative the sample size was not large enough. Hence, reporting of statistical power and sample size need to be improved. [ 11 , 12 ] Calculation of sample size can be guided by pilot studies undertaken, previous literature, and past clinical experiences. Sample size calculations require careful judgment and a compromise between strict criteria and practicality of access. [ Table 1 ] summarizes the effect of factors on the required sample size.
- SOFTWARE FOR ESTIMATING SAMPLE SIZE
Sample size calculating software has made it easier and simpler to estimate sample size. The software for estimating sample size varies with the type of study design. Existing statistical software such as Statistical Package for the Social Sciences, SAS, Stata, and R has the methods for determining sample size incorporated into them. Exclusive software such as PASS, G*Power, Power and Precision, Russ Lenths power, Minitab, and SampSize is also available for calculating sample size. [ 9 , 13 ]
Most of the software used for estimation of sample size have limited validity as usually they use a single formula. Any error can further mislead the researcher and the results of the study and hence it is essential that these errors are controlled. A review by Abbassi et al . which studied the accuracy of online sample size calculators showed that most of the sites merely calculated the sample size for estimating proportions and considered 50% as a fixed value in the formula for calculation. [ 14 ] The results were not accurate for the examples which were considered. Discretion needs to be exercised while using online calculators and the researcher should be well aware of the research design, the outcome, common errors, and the method and parameters being used to estimate the sample size.
- HYPOTHESIS RESEARCH AND NON-HYPOTHESIS RESEARCH AND ITS RELATION WITH SAMPLE SIZE
Most of scientific research is driven by hypothesis, which is described as an educated guess based on observations made and prior knowledge on the same. The null and alternative hypotheses are two mutually exclusive statements about a population. The null hypothesis (H 0 ), states the opposite of what is expected by the researcher and the alternative hypothesis (H a ) states the results which are expected by the researcher. Hypothesis testing uses data to determine whether to accept or reject the null hypothesis. Inability to reject the null hypothesis may also mean that the evidence required to reject it is not sufficient.
Conventionally, research in many sectors was pursued through hypothesis-driven investigations. Lately, research which is not driven by hypothesis is gaining momentum. Research not driven by hypothesis may include model and database development, high throughput genomics, engineering, and biology. [ 15 ] This way of research allows data to lead the way and we can embark on bolder journeys which are not constrained by our existing knowledge. High performance computing and large sample size are an important catalyst which will fuel the journey of hypothesis free research and open new avenues.
- EVOLUTION OF SAMPLE SIZE IN DIFFERENT SECTORS
The data volumes are exploding, more data have been created in the past 2 years than in the entire previous history of the human race. Big data is encompassing all sectors right from healthcare, governance, finance, psychiatry, remote sensing, manufacturing, education, etc.
Companies across various sectors of industry are leveraging big data to ensure data driven decision making. The large volumes of data and the corresponding number of data points being analyzed are redefining many industries.
A review by Button et al . indicated how small sample size undermines reliability in the field of neuroscience. Their results indicated that the median statistical power in neuroscience is only 21%. When a study which is underpowered discovers a true effect, it is likely that the estimate of the size of the effect provided will be inflated. This is called as the winner’s curse. Hence, if a study with small sample size is the only source of evidence it is difficult to have confidence in that evidence. In spite of scientists pursuing smaller effects the average sample size has not changed over time in the field of neuroscience. The advances in the analysis techniques and time taken have not been reflected in the aspects of study design and research in the field. [ 16 , 17 ] Unreliable research is wasteful and inefficient and hence there is an ethical dimension to low power.
On the contrary, in the field of empirical finance, vast majority of the studies use large sample sizes and use the conventional thresholds for statistical significance which may lead to a large sample bias. [ 18 ] Therefore, suitable thresholds for statistical significance have to be employed for a given sample size.
Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year. The health-care industry is rapidly following suit given the advent of electronic medical record systems which capture the data in a structured format. [ 19 - 21 ] This is a highly welcome change and will ensure the reliability and reproducibility in the field of healthcare.
- IMPORTANCE OF SAMPLING IN THE ERA OF BIG DATA
Big data refers to datasets that are too large or complex for traditional data processing applications. At present, the pace at which data are being generated in all fields on a day to day basis is rapid. Due to this advancement in the current digital era and decrease in the costs of collection of data and processing, we have overcome few obstacles which were faced traditionally with respect to statistical sampling.
Large datasets have their fair share of advantages. The data can be used to study rare events given its volume. If any outlier is present in our sample, large sample size refrains us from making any statistically misguided decisions. Margin of error is a measure which indicates to what extent our results will differ from the actual value of the population. The relationship between margin of error and sample size is inverse. The larger the sample size, the smaller the margin of error. Lower margin of error also signifies a higher confidence level in the obtained results. Increasing the sample size after a certain point provides a diminishing return as the increase in accuracy turns out to be negligible. [ 22 ] Bringing down the margin of error below a certain threshold is rarely beneficial.
In fact, it would be ideal to spend the existing resources on reducing sources which are responsible for bias. [ Figure 1 ] illustrates the relationship between sample size and margin of error.
- LARGE SAMPLE SIZE AND POTENTIAL FOR BIAS: HOW TO TREAD WITH CAUTION
While advances in the digital era have ensured that we do not face most of the obstacles that we faced traditionally with regard to statistical sampling, it has its own share of challenges. In spite of the sample size being large, our data might yet be a representative of only a part of the population and not the whole. Sampling bias is one of the major factor which affects the performance of our model and the obtained results. [ 23 ] It should be ensured that the training and test datasets which are used to train and test our model should mirror the same distribution which is reflected in the entire dataset.
“Big data Hubris” is the notion that big data analytics can be used as a substitute rather than a supplement to traditional means of analytics. Google flu trend and poll of Literary Digest of the 1936 United States Presidential election are classic examples of this. Before the 1936 election, the poll by the literary digest magazine had always correctly predicted the winner. In 1936, the poll concluded that the Republican candidate, Governor Landon was likely to win by a majority against the incumbent President Franklin D. Roosevelt. On the day of the results, Roosevelt won the elections by a landslide. The magazine had polled a sample of over 2 million people based on car and telephone registrations. However, there was a problem with the sample frame. That was the time of the Depression and not everyone could afford a car or a telephone. In spite of the large sample size, there was a discrepancy in the sample frame. In 2008, Google launched Google Flu Trends (GFT) to predict the spread of influenza across the US. GFT consistently overestimated visits related to flu and was highly inaccurate during the peak flu season when it could have proven to be most useful. This reiterates the fact that a sample frame which is incorrect could potentially destroy a study irrespective of the sample size. Hence, considerable efforts and time should be invested in selecting sampling techniques which are appropriate rather than amassing data of the whole population who are accessible.
When the sample size is small, it is easier to check and control the quality of data and spot errors if any. In case of a larger sample size greater effort and time should be spent to check the accuracy and quality of data and identifying outliers or missing values before we perform any further analysis.
Studies involving large sample size can identify effects which are significant but may be inconsequential. [ 23 ] If we were to compare two trials one with a smaller sample size and another with a larger sample size and we assume the level of significance to be 0.05 in both the cases, the effect size in case of the study with a smaller sample size would be significantly larger than the one with the larger sample size to achieve the same level of significance. Even though studies with larger sample sizes have many advantages, we should remain cautious of the fact that the effect of the treatment can be quite modest. Larger sample size will never be able to compensate for the other challenges which are faced in analytics. Hence, our main focus apart from increasing the sample size should also involve reduction in sampling bias, and other errors. [ 23 ]
As with any type of data, the data captured in Electronic Medical Records are only as good as the information captured by the systems. Even though the sample size is large, the system capturing the data needs to be robust and should ensure that the data is captured in a uniform format with structured forms and databases. In the future, all kinds of data pertaining to behavior, environment, and other important aspects need to be captured to expand the scope of variables which can be included to study their effect on the outcomes.
The larger the sample size it implies that there is more insightful information, lesser margin of error, higher confidence level and models with more accuracy with respect to how they have been used. Identification of rare side effects due to medication and outcomes in people with diseases which are rare will also greatly benefit from a larger sample size.
We need to focus on creation of good quality data and structured systems to capture the sample. Good data quality management makes sure that the data are structured appropriately. Maintaining high levels of data quality enables organizations to reduce the cost of identifying and fixing bad data in their systems. It also helps prioritize and ensure the best use of resources.
Along with having a large sample size, the focus should be on getting to know the data better, the sample frame and the context in which it was collected. Exploratory data analysis as an initial step will help in unearthing all that the data have to reveal and also identify the outliers and missing values.
We have witnessed the evolution of sample size from hundreds to thousands to millions and it will continue evolve to trillion and beyond (H-K-M-B-T) with rapid growth of data and exponential growth in technology. We are hopeful that the generation of new knowledge and data will open up the frontiers of research, development and growth.
Sources of literature review include peer-reviewed articles, books, and conference papers.
Author contributions
Concept, A.V.D.; Design, A.V.D., G.S.P.; Literature search, G.S.P.; Manuscript preparation, G.S.P., A.V.D.; Manuscript editing, A.V.D., G.S.P.; Manuscript review, A.V.D. G.S.P.
Declaration of patient consent
Patient’s consent not required as there are no patients in this study.
Financial support and sponsorship
Conflicts of interest.
There are no conflicts of interest.
- Faber J , Fonseca LM . How sample size influences research outcomes. In: Dental Press J Orthod. Vol 19 . 2014 . p. : 27 - 9 . [CrossRef] [PubMed] [Google Scholar]
- Faulkner SL , Trotter SP . Data saturation In: The International Encyclopedia of Communication Research Methods. New Jersey, United States: Wiley ; 2017 . p. : 1 - 2 . [CrossRef] [PubMed] [Google Scholar]
- Trotter HF . An elementary proof of the central limit theorem. Arch Math . 1959; 10 : 226 - 34 . [CrossRef] [Google Scholar]
- Charan J , Biswas T . How to calculate sample size for different study designs in medical research? Indian J Psychol Med . 2013; 35 : 121 - 6 . [CrossRef] [PubMed] [Google Scholar]
- Noordzij M , Dekker FW , Zoccali C , Jager KJ . Sample size calculations. Nephron Clin Pract . 2011; 118 : c319 - 23 . [CrossRef] [PubMed] [Google Scholar]
- Kirby A , Gebski V , Keech AC . Determining the sample size in a clinical trial. Med J Aust . 2002; 177 : 256 - 7 . [CrossRef] [PubMed] [Google Scholar]
- Phillips A , Campbell M . Using aspects of study design in sample size estimation. J Biopharm Stat . 1997; 7 : 215 - 26 . [CrossRef] [PubMed] [Google Scholar]
- Dworkin SL . Sample size policy for qualitative studies using in-depth interviews. Arch Sex Behav . 2012; 41 : 1319 - 20 . [CrossRef] [PubMed] [Google Scholar]
- Chander N . Sample size estimation. J Indian Prosthodont Soc . 2017; 17 : 217 - 8 . [CrossRef] [PubMed] [Google Scholar]
- Farrokhyar F , Reddy D , Poolman RW , Bhandari M . Why perform a priori sample size calculation? Can J Surg . 2013; 56 : 207 - 13 . [CrossRef] [PubMed] [Google Scholar]
- Abdulatif M , Mukhtar A , Obayah G . Pitfalls in reporting sample size calculation in randomized controlled trials published in leading anaesthesia journals: A systematic review. Br J Anaesth . 2015; 115 : 699 - 707 . [CrossRef] [PubMed] [Google Scholar]
- Noordzij M , Tripepi G , Dekker FW , Zoccali C , Tanck MW , Jager KJ . Sample size calculations: Basic principles and common pitfalls. Nephrol Dial Transplant . 2010; 25 : 1388 - 93 . [CrossRef] [PubMed] [Google Scholar]
- Dattalo P . A review of software for sample size determination. Eval Health Prof . 2009; 32 : 229 - 48 . [CrossRef] [PubMed] [Google Scholar]
- Abbassi M , Emamzadeh-Fard S , Yoosefi-Khanghah S , Mohammadi-Vajari M-A , Taee F , Meysamie A . Sample size calculation on web, can we rely on the results? J Med Stat Inform . 2014; 2 : 3 . [CrossRef] [Google Scholar]
- Rigoutsos I , Floratos A . Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics . 1998; 14 : 55 - 67 . [CrossRef] [PubMed] [Google Scholar]
- Button KS , Ioannidis JP , Mokrysz C , Nosek BA , Flint J , Robinson ES , et al. Power failure: Why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci . 2013; 14 : 365 - 76 . [CrossRef] [PubMed] [Google Scholar]
- Szucs D , Ioannidis JP . Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990-2012) and of latest practices (2017-2018) in high-impact journals. Neuroimage . 2020; 221 : 117164 . [CrossRef] [PubMed] [Google Scholar]
- Michaelides M . Large sample size bias in empirical finance. Finance Res Lett . 2021; 41 : S1544612320316494 . [CrossRef] [Google Scholar]
- Das AV , Kammari P , Vadapalli R , Basu S . Big data and the eyeSmart electronic medical record system-an 8-year experience from a three-tier eye care network in India. Indian J Ophthalmol . 2020; 68 : 427 - 32 . [CrossRef] [PubMed] [Google Scholar]
- Donthineni PR , Kammari P , Shanbhag SS , Singh V , Das AV , Basu S . Incidence, demographics, types and risk factors of dry eye disease in India: Electronic medical records driven big data analytics report I. Ocul Surf . 2019; 17 : 250 - 6 . [CrossRef] [PubMed] [Google Scholar]
- Das AV , Podila S , Prashanthi GS , Basu S . Clinical profile of pterygium in patients seeking eye care in India: Electronic medical records-driven big data analytics report III. Int Ophthalmol . 2020; 40 : 1553 - 63 . [CrossRef] [PubMed] [Google Scholar]
- Horowitz I . The diminishing returns to sample information in the beta-binominal process. Z Nationalökonomie . 1972; 32 : 493 - 500 . [CrossRef] [Google Scholar]
- Kaplan RM , Chambers DA , Glasgow RE . Big data and large sample size: A cautionary note on the potential for bias. Clin Transl Sci . 2014; 7 : 342 - 6 . [CrossRef] [PubMed] [Google Scholar]
Showing results from PubMed
Suggested read for related articles:.
- Indian Health Outcomes, Public Health and Economics… January 7, 2022
- Impactful clinical trial and efficient clinical research January 31, 2024
- Podoconiosis in Africa: Through research to policy January 31, 2024
Print ISSN: 2833-7026 Online ISSN: 2831-7939
IMAGES
VIDEO
COMMENTS
methods of sampling. The sample must also be adequate in size - in fact, no more and no less. SAMPLE SIZE AND ETHICS A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not
It would be most appropriate to perform a preliminary survey with a small sample size, followed by a power analysis, and completion of the study using the appropriate number of samples estimated based on the power analysis. ... Krejcie RV, Morgan D. Determining Sample Size for Research Activities. Educ Psychol Meas. 1970;607-10: 10.1177/ ...
Graphic representation of the concepts of population, target population and study population. We will now separately consider the required parameters for sample size calculation in studies that aim at estimating the frequency of events (prevalence of health outcomes or behaviors, for example), to test associations between risk/protective factors and dichotomous health conditions (yes/no), as ...
Since such a research project scrutinizes the dynamic qualities of a situation (rather than elucidating the proportionate relationships among its constituents), the issue of sample size - as well as representativeness - has little bearing on the project's basic logic.
The graphs show the distribution of the test statistic (z-test) for the null hypothesis (plain line) and the alternative hypothesis (dotted line) for a sample size of (A) 32 patients per group, (B) 64 patients per group, and (C) 85 patients per group.For a difference in mean of 10, a standard deviation of 20, and a significance level α of 5%, the power (shaded area) increases from (A) 50%, to ...
Small sample research presents a challenge to current standards of design and analytic approaches and the underlying notions of what constitutes good prevention science. Yet, small sample research is critically important as the research questions posed in small samples often represent serious health concerns in vulnerable and underrepresented populations. This commentary considers the Special ...
The goal of my article is to help those who are interested in improving the scientific rigor of their positivist work. In this way, this article is designed to provide a general methodological treatise to help scholars working on topics characterized by small samples proceed in a more scientific fashion by avoiding common pitfalls and providing potential solutions (e.g., Antonakis et al., 2010 ...
The importance of small samples in medical research J Postgrad Med. 2021 Oct-Dec;67(4):219-223. doi: 10.4103/jpgm.JPGM_230_21. Authors A Indrayan 1 ... What sample size is small depends on the context. Keywords: Medical research; n= 1 self-experiments; small sample. ...
A sample size which is too small will not be a true representation of the population whereas a large sample size will involve putting more individuals at risk. An optimum sample size needs to be employed to identify statistically significant differences if they exist and obtain scientifically valid results.
Sample size considerations in research. Go to citation Crossref Google Scholar. The perceived societal impact of the fourth industrial revolution in S... Go to citation Crossref Google Scholar. Text Generation to Aid Depression Detection: A Comparative Study of Co...