Replicates and repeats in designed experiments

In this topic, what is a replicate, what is the difference between replicates and repeats, example of replicates and repeats.

Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. You can replicate combinations of factor levels, groups of factor level combinations, or entire designs.

For example, if you have three factors with two levels each and you test all combinations of factor levels (full factorial design), one replicate of the entire design would have 8 runs (2 3 ). You can choose to do the design one time or have multiple replicates.

  • Screening designs to reduce a large set of factors usually don't use multiple replicates.
  • If you are trying to create a prediction model, multiple replicates can increase the precision of your model.
  • If you have more data, you might be able to detect smaller effects or have greater power to detect an effect of fixed size.
  • Your resources can dictate the number of replicates you can run. For example, if your experiment is extremely costly, you might be able to run it only one time.

Repeat and replicate measurements are both multiple response measurements taken at the same combination of factor settings; but repeat measurements are taken during the same experimental run or consecutive runs, while replicate measurements are taken during identical but different experimental runs, which are often randomized.

It is important to understand the differences between repeat and replicate response measurements. These differences affect the structure of the worksheet and the columns in which you enter the response data, which in turn affects how Minitab interprets the data. You enter repeats across rows of multiple columns, while you enter replicates down a single column.

Whether you use repeats or replicates depends on the sources of variability you want to explore and your resource constraints. Because replicates are from different experimental runs, usually spread along a longer period of time, they can include sources of variability that are not included in repeat measurements. For example, replicates can include variability from changing equipment settings between runs or variability from other environmental factors that may change over time. Replicate measurements can be more expensive and time-consuming to collect. You can create a design with both repeats and replicates, which enables you to examine multiple sources of variability.

A manufacturing company has a production line with a number of settings that can be modified by operators. Quality engineers design two experiments, one with repeats and one with replicates, to evaluate the effect of the settings on quality.

  • The first experiment uses repeats. The operators set the factors at predetermined levels, run production, and measure the quality of five products. They reset the equipment to new levels, run production, and measure the quality of five products. They continue until production is run one time at each combination of factor settings and five quality measurements are taken at each run.
  • The second experiment uses replicates. The operators set the factors at predetermined levels, run production, and take one quality measurement. They reset the equipment, run production, and take one quality measurement. In random order, the operators run each combination of factor settings five times, taking one measurement at each run.

In each experiment, five measurements are taken at each combination of factor settings. In the first experiment, the five measurements are taken during the same run; in the second experiment, the five measurements are taken in different runs. The variability between measurements taken at the same factor settings tends to be greater for replicates than for repeats because the machines are reset before each run, adding more variability to the process.

  • Minitab.com
  • License Portal
  • Cookie Settings

You are now leaving support.minitab.com.

Click Continue to proceed to:

The Happy Scientist

The Happy Scientist

Error message, what is science: repeat and replicate.

In the scientific process, we should not rely on the results of a single test. Instead, we should perform the test over and over. Why? If it works once, shouldn't it work the same way every time? Yes, it should, so if we repeat the experiment and get a different result, then we know that there is something about the test that we are not considering.

If your system blocks Vimeo, click here to use the alternate player

In studying the processes of science, you will often run into two words, which seem similar: Repetition and Replication

Sometimes it is a matter of random chance, as in the case of flipping a coin. Just because it comes up heads the first time does not mean that it will always come up heads. By repeating the experiment over and over, we can see if our result really supports our hypothesis ( What is a Hypothesis? ), or if it was just random chance.

Sometimes the result might be due to some variable that you have not recognized. In our example of flipping a coin, the individual's technique for flipping the coin might influence the results. To take that into consideration, we repeat the experiment over and over with different people, looking closely for any results that don't fit into the idea we are testing.

Results that don't fit are important! Figuring out why they do not fit our hypothesis can give us an opportunity to learn new things, and get a better understanding of the idea we are testing.

Replication

Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for the same reason. Those different results give us a chance to discover more about our idea. The different results may be because the person replicating our tests did something different, but they also might be because that person noticed something that we missed.

What if you are wrong!

If we did miss something, it is OK, as long as we performed our tests honestly and scientifically. Science is not about proving that "I am right!" Instead, it is a process for trying to learn more about the universe and how it works. It is usually a group effort, with each scientist adding her own perspective to the idea, giving us a better understanding and often raising new questions to explore.

Please log in.

Search form

Search by topic, search better.

  • Life Science
  • Earth Science
  • Chemical Science
  • Space Science
  • Physical Science
  • Process of Science

view conversion

Types of Replicates: Technical vs. Biological

Tuesday, 18 February, 2020

“Authors must state the number of independent samples (biological replicates) and the number of replicate samples (technical replicates) and report how many times each experiment was replicated.” [1]  

When it comes to quantification, biological and technical replicates are key to generating accurate, reliable results. While they both offer researchers valuable data, each answers distinct questions about data reproducibility. Let's take a moment to explore the differences between technical and biological replicates.

What Are Technical Replicates?

Technical replicates are repeated measurements of the same sample that demonstrate the variability of the protocol. Technical replicates are important because they address the reproducibility of the assay or technique; however, they do not address the reproducibility of the effect or event that you are studying. Rather, they indicate whether your measurements are scientifically robust or noisy and how large the measured effect must be in order to stand out above the background noise. [2] Examples may include loading multiple lanes with each sample on the same blot, running multiple blots in parallel, or repeating the blot with the same samples on different days.

repeats in an experiment

When technical replicates are highly variable, it is more difficult to separate the observed effect from the assay variation. You may need to identify and reduce sources of error in your protocol to increase the precision of your assay. Technical replicates do not address the biological relevance of the results.

What Are Biological Replicates?

Biological replicates are parallel measurements of biologically distinct samples that capture random biological variation, which can be a subject of study or a source of noise itself. [3] Biological replicates are important because they address how widely your experimental results can be generalized. They indicate if an experimental effect is sustainable under a different set of biological variables.

For example, common biological replicates include repeating a particular assay with independently generated samples or samples derived from various cell types, tissue types, or organisms to see if similar results can be observed. Examples include analysis of samples from multiple mice rather than a single mouse, or from multiple batches of independently cultured and treated cells.

repeats in an experiment

To demonstrate the same effect in a different experimental context, the experiment might be repeated in multiple cell lines, in related cell types or tissues, or with other biological systems. An appropriate replication strategy should be developed for each experimental context. Several recent papers discuss considerations for choosing technical and biological replicates. [1,2,3]

For a helpful guide in choosing and incorporating the right technical and biological replicates for your experiment, check out the Quantitative Western Blot Analysis with Replicate Samples protocol.

Biological vs. Technical Replicates

Proper Western blot quantification requires the keen understanding and application of both biological and technical replicate best practices. Here are some additional resources that you can use as you plan your quantitative Western blot strategy.

  • Determining the Linear Range for Quantitative Western Blot Detection provides guidelines for characterizing the linear range of detection for your system.
  • Housekeeping Protein Validation Protocol walks you through the steps of how to validate that the expression of an HKP is constant across all samples and unaffected by the specific experimental context and conditions. If expression varies, then the HKP cannot be used for normalization.
  • Housekeeping Protein Normalization Protocol shows you how to use housekeeping proteins for Western blot normalization, as long as you have validated that their expression does not change under your experimental conditions.
  • Revert™ 700 Total Protein Stain Normalization Protocol describes how to use Revert 700 Total Protein Stain for Western blot normalization and quantitative analysis, fast becoming the gold standard for normalization of protein loading.
  • Pan/Phospho Analysis for Western Blot Normalization provides guidelines on how to use total and post-translationally modified proteins for Western blot normalization.

References:

  • [1] Instructions for Authors. The Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology. Web. 31 July 2017.
  • [2] Naegle K, Gough NR, Yaffe MB. Criteria for biological reproducibility: what does “n” mean? Sci Signal. 8 (371): fs7 (2015).
  • [3] Blainey P, Krzywinski M, and Altman N. (2014) Points of Significance: Replication . Nature Methods 11(9): 879-880. doi:10.1038/nmeth.30
  • [4] Collecting and Presenting Data. The Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology. Web. 9 May 2018.

Powered by Froala Editor

National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 5 replicability, 5 replicability.

Replication is one of the key ways scientists build confidence in the scientific merit of results. When the result from one study is found to be consistent by another study, it is more likely to represent a reliable claim to new knowledge. As Popper (2005 , p. 23) wrote (using “reproducibility” in its generic sense):

We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

However, a successful replication does not guarantee that the original scientific results of a study were correct, nor does a single failed replication conclusively refute the original claims. A failure to replicate previous results can be due to any number of factors, including the discovery of an unknown effect, inherent variability in the system, inability to control complex variables, substandard research practices, and, quite simply, chance. The nature of the problem under study and the prior likelihoods of possible results in the study, the type of measurement instruments and research design selected, and the novelty of the area of study and therefore lack of established methods of inquiry can also contribute to non-replicability. Because of the complicated relationship between replicability and its variety of sources, the validity of scientific results should be considered in the context of an entire body of evidence, rather than an individual study or an individual replication. Moreover, replication may be a matter of degree, rather than a binary result of “success” or “failure.” 1 We explain in Chapter 7 how research synthesis, especially meta-analysis, can be used to evaluate the evidence on a given question.

ASSESSING REPLICABILITY

How does one determine the extent to which a replication attempt has been successful? When researchers investigate the same scientific question using the same methods and similar tools, the results are not likely to be identical—unlike in computational reproducibility in which bitwise agreement between two results can be expected (see Chapter 4 ). We repeat our definition of replicability, with emphasis added: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

Determining consistency between two different results or inferences can be approached in a number of ways ( Simonsohn, 2015 ; Verhagen and Wagenmakers, 2014 ). Even if one considers only quantitative criteria for determining whether two results qualify as consistent, there is variability across disciplines ( Zwaan et al., 2018 ; Plant and Hanisch, 2018 ). The Royal Netherlands Academy of Arts and Sciences (2018 , p. 20) concluded that “it is impossible to identify a single, universal approach to determining [replicability].” As noted in Chapter 2 , different scientific disciplines are distinguished in part by the types of tools, methods, and techniques used to answer questions specific to the discipline, and these differences include how replicability is assessed.

___________________

1 See, for example, the cancer biology project in Table 5-1 in this chapter.

Acknowledging the different approaches to assessing replicability across scientific disciplines, however, we emphasize eight core characteristics and principles:

  • Attempts at replication of previous results are conducted following the methods and using similar equipment and analyses as described in the original study or under sufficiently similar conditions ( Cova et al., 2018 ). 2 Yet regardless of how similar the replication study is, no second event can exactly repeat a previous event.
  • The concept of replication between two results is inseparable from uncertainty, as is also the case for reproducibility (as discussed in Chapter 4 ).
  • Any determination of replication (between two results) needs to take account of both proximity (i.e., the closeness of one result to the other, such as the closeness of the mean values) and uncertainty (i.e., variability in the measures of the results).
  • To assess replicability, one must first specify exactly what attribute of a previous result is of interest. For example, is only the direction of a possible effect of interest? Is the magnitude of effect of interest? Is surpassing a specified threshold of magnitude of interest? With the attribute of interest specified, one can then ask whether two results fall within or outside the bounds of “proximity-uncertainty” that would qualify as replicated results.
  • Depending on the selected criteria (e.g., measure, attribute), assessments of a set of attempted replications could appear quite divergent. 3
  • A judgment that “Result A is replicated by Result B” must be identical to the judgment that “Result B is replicated by Result A.” There must be a symmetry in the judgment of replication; otherwise, internal contradictions are inevitable.
  • There could be advantages to inverting the question from, “Does Result A replicate Result B (given their proximity and uncertainty)?”

2 Cova et al. (2018, fn. 3) discuss the challenge of defining sufficiently similar as well as the interpretation of the results:

In practice, it can be hard to determine whether the ‘sufficiently similar’ criterion has actually been fulfilled by the replication attempt, whether in its methods or in its results ( Nakagawa and Parker 2015 ). It can therefore be challenging to interpret the results of replication studies, no matter which way these results turn out ( Collins, 1975 ; Earp and Trafimow, 2015 ; Maxwell et al., 2015 ).

3 See Table 5-1 , for an example of this in the reviews of a psychology replication study by Open Science Collaboration (2015) and Patil et al. (2016) .

to “Are Results A and B sufficiently divergent (given their proximity and uncertainty) so as to qualify as a non-replication ?” It may be advantageous, in assessing degrees of replicability, to define a relatively high threshold of similarity that qualifies as “replication,” a relatively low threshold of similarity that qualifies as “non-replication,” and the intermediate zone between the two thresholds that is considered “indeterminate.” If a second study has low power and wide uncertainties, it may be unable to produce any but indeterminate results.

  • While a number of different standards for replicability/non-replicability may be justifiable, depending on the attributes of interest, a standard of “repeated statistical significance” has many limitations because the level of statistical significance is an arbitrary threshold ( Amrhein et al., 2019a ; Boos and Stefanski, 2011 ; Goodman, 1992 ; Lazzeroni et al., 2016 ). For example, one study may yield a p -value of 0.049 (declared significant at the p ≤ 0.05 level) and a second study yields a p -value of 0.051 (declared nonsignificant by the same p -value threshold) and therefore the studies are said not to be replicated. However, if the second study had yielded a p -value of 0.03, the reviewer would say it had successfully replicated the first study, even though the result could diverge more sharply (by proximity and uncertainty) from the original study than in the first comparison. Rather than focus on an arbitrary threshold such as statistical significance, it would be more revealing to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (or uncertainties), and additional metrics tailored to the subject matter.

The final point above is reinforced by a recent special edition of the American Statistician in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation ( Wasserstein et al., 2019 ). A figure from ( Amrhein et al., 2019b ) also demonstrates this point, as shown in Figure 5-1 .

One concern voiced by some researchers about using a proximity-uncertainty attribute to assess replicability is that such an assessment favors studies with large uncertainties; the potential consequence is that many researchers would choose to perform low-power studies to increase the replicability chances ( Cova et al., 2018 ). While two results with large uncertainties and within proximity, such that the uncertainties overlap with each other, may be consistent with replication, the large uncertainties indicate that not much confidence can be placed in that conclusion.

Image

CONCLUSION 5-1: Different types of scientific studies lead to different or multiple criteria for determining a successful replication. The choice of criteria can affect the apparent rate of non-replication, and that choice calls for judgment and explanation.

CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained “statistical significance,” that is, when the p -values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (uncertainties), and additional metrics tailored to the subject matter.

THE EXTENT OF NON-REPLICABILITY

The committee was asked to assess what is known and, if necessary, identify areas that may need more information to ascertain the extent

of non-replicability in scientific and engineering research. The committee examined current efforts to assess the extent of non-replicability within several fields, reviewed literature on the topic, and heard from expert panels during its public meetings. We also drew on the previous work of committee members and other experts in the field of replicability of research.

Some efforts to assess the extent of non-replicability in scientific research directly measure rates of replication, while others examine indirect measures to infer the extent of non-replication. Approaches to assessing non-replicability rates include

  • direct and indirect assessments of replicability;
  • perspectives of researchers who have studied replicability;
  • surveys of researchers; and
  • retraction trends.

This section discusses each of these lines of evidence.

Assessments of Replicability

The most direct method to assess replicability is to perform a study following the original methods of a previous study and to compare the new results to the original ones. Some high-profile replication efforts in recent years include studies by Amgen, which showed low replication rates in biomedical research ( Begley and Ellis, 2012 ), and work by the Center for Open Science on psychology ( Open Science Collaboration, 2015 ), cancer research ( Nosek and Errington, 2017 ), and social science ( Camerer et al., 2018 ). In these examples, a set of studies was selected and a single replication attempt was made to confirm results of each previous study, or one-to-one comparisons were made. In other replication studies, teams of researchers performed multiple replication attempts on a single original result, or many-to-one comparisons (see e.g., Klein et al., 2014 ; Hagger et al., 2016 ; and Cova et al., 2018 in Table 5-1 ).

Other measures of replicability include assessments that can provide indicators of bias, errors, and outliers, including, for example, computational data checks of reported numbers and comparison of reported values against a database of previously reported values. Such assessments can identify data that are outliers to previous measurements and may signal the need for additional investigation to understand the discrepancy. 4 Table 5-1 summarizes the direct and indirect replication studies assembled by the committee. Other sources of non-replicabilty are discussed later in this chapter in the Sources of Non-Replicability section.

4 There is risk of missing a new discovery by rejecting data outliers without further investigation.

Many direct replication studies are not reported as such. Replication—especially of surprising results or those that could have a major impact—occurs in science often without being labelled as a replication. Many scientific fields conduct reviews of articles on a specific topic—especially on new topics or topics likely to have a major impact—to assess the available data and determine which measurements and results are rigorous (see Chapter 7 ). Therefore, replicability studies included as part of the scientific literature but not cited as such add to the difficulty in assessing the extent of replication and non-replication.

One example of this phenomenon relates to research on hydrogen storage capacity. The U.S. Department of Energy (DOE) issued a target storage capacity in the mid-1990s. One group using carbon nanotubes reported surprisingly high values that met DOE’s target ( Hynek et al., 1997 ); other researchers who attempted to replicate these results could not do so. At the same time, other researchers were also reporting high values of hydrogen capacity in other experiments. In 2003, an article reviewed previous studies of hydrogen storage values and reported new research results, which were later replicated ( Broom and Hirscher, 2016 ). None of these studies was explicitly called an attempt at replication.

Based on the content of the collected studies in Table 5-1 , one can observe that the

  • majority of the studies are in the social and behavioral sciences (including economics) or in biomedical fields, and
  • methods of assessing replicability are inconsistent and the replicability percentages depend strongly on the methods used.

The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported replication studies found widely varying rates of non-replication ( Gilbert et al., 2016 ). At the same time, replication studies often provide more and better-quality evidence than most original studies alone, and they highlight such methodological features as high precision or statistical power, preregistration, and multi-site collaboration ( Nosek, 2016 ). Some would argue that focusing on replication of a single study as a way to improve the efficiency of science is ill-placed. Rather, reviews of cumulative evidence on a subject, to gauge both the overall effect size and generalizability, may be more useful ( Goodman, 2018 ; and see Chapter 7 ).

Apart from specific efforts to replicate others’ studies, investigators will typically confirm their own results, as in a laboratory experiment, prior to

TABLE 5-1 Examples of Replication Studies

Field and Author(s) Description Results Type of Assessment
Experimental Philosophy ( ) A group of 20 research teams performed replication studies of 40 experimental philosophy studies published between 2003 and 2015 70% of the 40 studies were replicated by comparing the original effect size to the confidence interval (CI) of the replication. Direct
Behavioral Science, Personality Traits Linked to Life Outcomes ( ) Performed replications of 78 previously published associations between the Big Five personality traits and consequential life outcomes 87% of the replication attempts were statistically significant in the expected direction, and effects were typically 77% as strong as the corresponding original effects. Direct
Behavioral Science, Ego-Depletion Effect ( ) Multiple laboratories (23 in total) conducted replications of a standardized ego-depletion protocol based on a sequential-task paradigm by Meta-analysis of the studies revealed that the size of the ego-depletion effect was small with 95% CI that encompassed zero (d = 0.04, 95% CI [−0.07, 0.15]).
General Biology, Preclinical Animal Studies ( ) Attempt by researchers from Bayer HealthCare to validate data on potential drug targets obtained in 67 projects by copying models exactly or by adapting them to internal needs Published data were completely in line with the results of the validation studies in 20%-25% of cases. Direct
Oncology, Preclinical Studies ( ) Attempt by Amgen team to reproduce the results of 53 “landmark” studies Scientific results were confirmed in 11% of the studies. Direct
Genetics, Preclinical Studies ( ) Replication of data analyses provided in 18 articles on microarray-based gene expression studies Of the 18 studies, 2 analyses (11%) were replicated; 6 were partially replicated or showed some discrepancies in results; and 10 could not be replicated. Direct
Experimental Psychology ( ) Replication of 13 psychological phenomena across 36 independent samples 77% of phenomena were replicated consistently. Direct
Field and Author(s) Description Results Type of Assessment
Experimental Psychology, Many Labs 2 ( ) Replication of 28 classic and contemporary published studies 54% of replications produced a statistically significant effect in the same direction as the original study, 75% yielded effect sizes smaller than the original ones, and 25% yielded larger effect sizes than the original ones. Direct
Experimental Psychology ( ) Attempt to independently replicate selected results from 100 studies in psychology 36% of the replication studies produced significant results, compared to 97% of the original studies. The mean effect sizes were halved. Direct
Experimental Psychology ( ) Using reported data from the replication study in psychology, reanalyzed the results 77% of the studies replicated by comparing the original effect size to an estimated 95% CI of the replication. Direct
Experimental Psychology ( ) Attempt to replicate 21 systematically selected experimental studies in the social sciences published in and in 2010-2015 Found a significant effect in the same direction as the original study for 62% (13 of 21) studies, and the effect size of the replications was on average about 50% of the original effect size. Direct
Empirical Economics ( ) 2-year study that collected programs and data from authors and attempted to replicate their published results on empirical economic research Two of nine replications were successful, three “near” successful, and four unsuccessful; findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence. Direct
Economics ( ) Progress report on the number of journals with data sharing requirements and an assessment of 167 studies 10 journals explicitly note they publish replications; of 167 published replication studies, approximately 66% were unable to confirm the original results; 12% disconfirmed at least one major result of the original study, while confirming others. N/A
Field and Author(s) Description Results Type of Assessment
Economics ( ) An effort to replicate 18 studies published in the and the from 2011-2014 Significant effect in the same direction as the original study found for 11 replications (61%); on average, the replicated effect size was 66% of the original. Direct
Chemistry ( ; ) Collaboration with National Institute of Standards and Technology (NIST) to check new data against NIST database, 13,000 measurements 27% of papers reporting properties of adsorption had data that were outliers; 20% of papers reporting carbon dioxide isotherms as outliers. Indirect
Chemistry ( ) Collaboration with NIST, Thermodynamics Research Center (TRC) databases, prepublication check of solubility, viscosity, critical temperature, and vapor pressure 33% experiments had data problems, such as uncertainties too small, reported values outside of TRC database distributions. Indirect
Biology Reproducibility Project: Cancer Biology Large-scale replication project to replicate key results in 29 cancer papers published in , , and other high-impact journals The first five articles have been published; two replicated important parts of the original papers, one did not replicate, and two were uninterpretable. Direct
Psychology, Statistical Checks ( ) Statcheck tool used to test statistical values within psychology articles from 1985-2013 49.6% of the articles with null hypothesis statistical test (NHST) results contained at least one inconsistency (8,273 of the 16,695 articles), and 12.9% (2,150) of the articles with NHST results contained at least one gross inconsistency. Indirect
Engineering, Computational Fluid Dynamics ( ) Full replication studies of previously published results on bluff-body aerodynamics, using four different computational methods Replication of the main result was achieved in three out of four of the computational efforts. Direct
Field and Author(s) Description Results Type of Assessment
Psychology, Many Labs 3 ( ) Attempt to replicate 10 psychology studies in one online session 3 of 10 studies replicated at < 0.05. Direct
Psychology ( ) Argued that one of the failed replications in Ebersole et al. was due to changes in the procedure. They randomly assigned participants to a version closer to the original or to Ebersole et al.’s version. The original study replicated when the original procedures were followed more closely, but not when the Ebersole et al. procedures were used. Direct
Psychology ( ) 17 different labs attempted to replicate one study on facial feedback by . None of the studies replicated the result at < 0.05. Direct
Psychology ( ) Pointed out that all of the studies in the replication project changed the procedure by videotaping participants. Conducted a replication in which participants were randomly assigned to be videotaped or not. The original study was replicated when the original procedure was followed ( = 0.01); the original study was not replicated when the video camera was present ( = 0.85). Direct
Psychology ( ) 31 labs attempted to replicate a study by Schooler and Engstler-Schooler (1990). Replicated the original study. The effect size was much larger when the original study was replicated more faithfully (the first set of replications inadvertently introduced a change in the procedure). Direct

NOTES: Some of the studies in this table also appear in Table 4-1 as they evaluated both reproducibility and replicability. N/A = not applicable.

a From Cova et al. (2018 , p. 14): “For studies reporting statistically significant results, we treated as successful replications for which the replication 95 percent CI [confidence interval] was not lower than the original effect size. For studies reporting null results, we treated as successful replications for which original effect sizes fell inside the bounds of the 95 percent CI.”

b From Soto (2019 , p. 7, fn. 1): “Previous large-scale replication projects have typically treated the individual study as the primary unit of analysis. Because personality-outcome studies often examine multiple trait-outcome associations, we selected the individual association as the most appropriate unit of analysis for estimating replicability in this literature.”

publication. More generally, independent investigators may replicate prior results of others before conducting, or in the course of conducting, a study to extend the original work. These types of replications are not usually published as separate replication studies.

Perspectives of Researchers Who Have Studied Replicability

Several experts who have studied replicability within and across fields of science and engineering provided their perspectives to the committee. Brian Nosek, cofounder and director of the Center for Open Science, said there was “not enough information to provide an estimate with any certainty across fields and even within individual fields.” In a recent paper discussing scientific progress and problems, Richard Shiffrin, professor of psychology and brain sciences at Indiana University, and colleagues argued that there are “no feasible methods to produce a quantitative metric, either across science or within the field” to measure the progress of science ( Shiffrin et al., 2018 , p. 2632). Skip Lupia, now serving as head of the Directorate for Social, Behavioral, and Economic Sciences at the National Science Foundation, said that there is not sufficient information to be able to definitively answer the extent of non-reproducibility and non-replicability, but there is evidence of p- hacking and publication bias (see below), which are problems. Steven Goodman, the codirector of the Meta-Research Innovation Center at Stanford University (METRICS), suggested that the focus ought not be on the rate of non-replication of individual studies, but rather on cumulative evidence provided by all studies and convergence to the truth. He suggested the proper question is “How efficient is the scientific enterprise in generating reliable knowledge, what affects that reliability, and how can we improve it?”

Surveys of scientists about issues of replicability or on scientific methods are indirect measures of non-replicability. For example, Nature published the results of a survey in 2016 in an article titled “1,500 Scientists Lift the Lid on Reproducibility ( Baker, 2016 )” 5 ; this article reported that a large percentage of researchers who responded to an online survey believe that replicability is a problem. This article has been widely cited by researchers studying subjects ranging from cardiovascular disease to crystal structures ( Warner et al., 2018 ; Ziletti et al., 2018 ). Surveys and studies have also assessed the prevalence of specific problematic research practices, such as a 2018 survey about questionable research practices in ecology and evolution

5 Nature uses the word “reproducibility” to refer to what we call “replicability.”

( Fraser et al., 2018 ). However, many of these surveys rely on poorly defined sampling frames to identify populations of scientists and do not use probability sampling techniques. The fact that nonprobability samples “rely mostly on people . . . whose selection probabilities are unknown [makes it] difficult to estimate how representative they are of the [target] population” ( Dillman, Smyth, and Christian, 2014 , pp. 70, 92). In fact, we know that people with a particular interest in or concern about a topic, such as replicability and reproducibility, are more likely to respond to surveys on the topic ( Brehm, 1993 ). As a result, we caution against using surveys based on nonprobability samples as the basis of any conclusion about the extent of non-replicability in science.

High-quality researcher surveys are expensive and pose significant challenges, including constructing exhaustive sampling frames, reaching adequate response rates, and minimizing other nonresponse biases that might differentially affect respondents at different career stages or in different professional environments or fields of study ( Corley et al., 2011 ; Peters et al., 2008 ; Scheufele et al., 2009 ). As a result, the attempts to date to gather input on topics related to replicability and reproducibility from larger numbers of scientists ( Baker, 2016 ; Boulbes et al., 2018 ) have relied on convenience samples and other methodological choices that limit the conclusions that can be made about attitudes among the larger scientific community or even for specific subfields based on the data from such surveys. More methodologically sound surveys following guidelines on adoption of open science practices and other replicability-related issues are beginning to emerge. 6 See Appendix E for a discussion of conducting reliable surveys of scientists.

Retraction Trends

Retractions of published articles may be related to their non-replicability. As noted in a recent study on retraction trends ( Brainard, 2018 , p. 392), “Overall, nearly 40% of retraction notices did not mention fraud or other kinds of misconduct. Instead, the papers were retracted because of errors, problems with reproducibility [or replicability], and other issues.” Overall, about one-half of all retractions appear to involve fabrication, falsification, or plagiarism. Journal article retractions in biomedicine increased from 50-60 per year in the mid-2000s, to 600-700 per year by the mid-2010s ( National Library of Medicine, 2018 ), and this increase attracted much commentary and analysis (see, e.g., Grieneisen and Zhang, 2012 ). A recent comprehensive review of an extensive database of 18,000 retracted papers

6 See https://cega.berkeley.edu/resource/the-state-of-social-science-betsy-levy-paluck-bitssannual-meeting-2018 .

dating back to the 1970s found that while the number of retractions has grown, the rate of increase has slowed; approximately 4 of every 10,000 papers are now retracted ( Brainard, 2018 ). Overall, the number of journals that report retractions has grown from 44 journals in 1997 to 488 journals in 2016; however, the average number of retractions per journal has remained essentially flat since 1997.

These data suggest that more journals are attending to the problem of articles that need to be retracted rather than a growing problem in any one discipline of science. Fewer than 2 percent of authors in the database account for more than one-quarter of the retracted articles, and the retractions of these frequent offenders are usually based on fraud rather than errors that lead to non-replicability. The Institute of Electrical and Electronics Engineers alone has retracted more than 7,000 abstracts from conferences that took place between 2009 and 2011, most of which had authors based in China ( McCook, 2018 ).

The body of evidence on the extent of non-replicabilty gathered by the committee is not a comprehensive assessment across all fields of science nor even within any given field of study. Such a comprehensive effort would be daunting due to the vast amount of research published each year and the diversity of scientific and engineering fields. Among studies of replication that are available, there is no uniform approach across scientific fields to gauge replication between two studies. The experts who contributed their perspectives to the committee all question the feasibility of such a science-wide assessment of non-replicability.

While the evidence base assessed by the committee may not be sufficient to permit a firm quantitative answer on the scope of non-replicability, it does support several findings and a conclusion.

FINDING 5-1: There is an uneven level of awareness of issues related to replicability across fields and even within fields of science and engineering.

FINDING 5-2: Efforts to replicate studies aimed at discerning the effect of an intervention in a study population may find a similar direction of effect, but a different (often smaller) size of effect.

FINDING 5-3: Studies that directly measure replicability take substantial time and resources.

FINDING 5-4: Comparing results across replication studies may be compromised because different replication studies may test different study attributes and rely on different standards and measures for a successful replication.

FINDING 5-5: Replication studies in the natural and clinical sciences (general biology, genetics, oncology, chemistry) and social sciences (including economics and psychology) report frequencies of replication ranging from fewer than one out of five studies to more than three out of four studies.

CONCLUSION 5-3: Because many scientists routinely conduct replication tests as part of a follow-on work and do not report replication results separately, the evidence base of non-replicability across all science and engineering research is incomplete.

SOURCES OF NON-REPLICABILITY

Non-replicability can arise from a number of sources. In some cases, non-replicability arises from the inherent characteristics of the systems under study. In others, decisions made by a researcher or researchers in study execution that reasonably differ from the original study such as judgment calls on data cleaning or selection of parameter values within a model may also result in non-replication. Other sources of non-replicability arise from conscious or unconscious bias in reporting, mistakes and errors (including misuse of statistical methods), and problems in study design, execution, or interpretation in either the original study or the replication attempt. In many instances, non-replication between two results could be due to a combination of multiple sources, but it is not generally possible to identify the source without careful examination of the two studies. Below, we review these sources of non-replicability and discuss how researchers’ choices can affect each. Unless otherwise noted, the discussion below focuses on the non-replicability between two results (i.e., a one-to-one comparison) when assessed using proximity and uncertainty of both results.

Non-Replicability That Is Potentially Helpful to Science

Non-replicability is a normal part of the scientific process and can be due to the intrinsic variation and complexity of nature, the scope of current scientific knowledge, and the limits of current technologies. Highly surprising and unexpected results are often not replicated by other researchers. In other instances, a second researcher or research team may purposefully make decisions that lead to differences in parts of the study. As long as these differences are reported with the final results, these may be reasonable actions to take yet result in non-replication. In scientific reporting, uncertainties within the study (such as the uncertainty within measurements, the potential interactions between parameters, and the variability of the

system under study) are estimated, assessed, characterized, and accounted for through uncertainty and probability analysis. When uncertainties are unknown and not accounted for, this can also lead to non-replicability. In these instances, non-replicability of results is a normal consequence of studying complex systems with imperfect knowledge and tools. When non-replication of results due to sources such as those listed above are investigated and resolved, it can lead to new insights, better uncertainty characterization, and increased knowledge about the systems under study and the methods used to study them. See Box 5-1 for examples of how investigations of non-replication have been helpful to increasing knowledge.

The susceptibility of any line of scientific inquiry to sources of non-replicability depends on many factors, including factors inherent to the system under study, such as the

  • complexity of the system under study;
  • understanding of the number and relations among variables within the system under study;
  • ability to control the variables;
  • levels of noise within the system (or signal to noise ratios);
  • mismatch of scale of the phenomena and the scale at which it can be measured;
  • stability across time and space of the underlying principles;
  • fidelity of the available measures to the underlying system under study (e.g., direct or indirect measurements); and
  • prior probability (pre-experimental plausibility) of the scientific hypothesis.

Studies that pursue lines of inquiry that are able to better estimate and analyze the uncertainties associated with the variables in the system and control the methods that will be used to conduct the experiment are more replicable. On the other end of the spectrum, studies that are more prone to non-replication often involve indirect measurement of very complex systems (e.g., human behavior) and require statistical analysis to draw conclusions. To illustrate how these characteristics can lead to results that are more or less likely to replicate, consider the attributes of complexity and controllability. The complexity and controllability of a system contribute to the underlying variance of the distribution of expected results and thus the likelihood of non-replication. 7

7 Complexity and controllability in an experimental system affect its susceptibility to non-replicability independently from the way prior odds, power, or p- values associated with hypothesis testing affect the likelihood that an experimental result represents the true state of the world.

The systems that scientists study vary in their complexity. Although all systems have some degree of intrinsic or random variability, some systems are less well understood, and their intrinsic variability is more difficult to assess or estimate. Complex systems tend to have numerous interacting components (e.g., cell biology, disease outbreaks, friction coefficient between two unknown surfaces, urban environments, complex organizations and populations, and human health). Interrelations and interactions among multiple components cannot always be predicted and neither can the resulting effects on the experimental outcomes, so an initial estimate of uncertainty may be an educated guess.

Systems under study also vary in their controllability. If the variables within a system can be known, characterized, and controlled, research on such a system tends to produce more replicable results. For example, in social sciences, a person’s response to a stimulus (e.g., a person’s behavior when placed in a specific situation) depends on a large number of variables—including social context, biological and psychological traits, verbal and nonverbal cues from researchers—all of which are difficult or impossible to control completely. In contrast, a physical object’s response to a physical stimulus (e.g., a liquid’s response to a rise in temperature) depends almost entirely on variables that can either be controlled or adjusted for, such as temperature, air pressure, and elevation. Because of these differences, one expects that studies that are conducted in the relatively more controllable systems will replicate with greater frequency than those that are in less controllable systems. Scientists seek to control the variables relevant to the system under study and the nature of the inquiry, but when these variables are more difficult to control, the likelihood of non-replicability will be higher. Figure 5-2 illustrates the combinations of complexity and controllability.

Many scientific fields have studies that span these quadrants, as demonstrated by the following examples from engineering, physics, and psychology. Veronique Kiermer, PLOS executive editor, in her briefing to the committee noted: “There is a clear correlation between the complexity of the design, the complexity of measurement tools, and the signal to noise ratio that we are trying to measure.” (See also Goodman et al., 2016 , on the complexity of statistical and inferential methods.)

Engineering . Aluminum-lithium alloys were developed by engineers because of their strength-to-weight ratio, primarily for use in aerospace engineering. The process of developing these alloys spans the four quadrants. Early generation of binary alloys was a simple system that showed high replicability (Quadrant A). Second-generation alloys had higher amounts of lithium and resulted in lower replicability that appeared as failures in manufacturing operations because the interactions of the elements were not understood (Quadrant C). The third-generation alloys contained less

Image

lithium and higher relative amounts of other alloying elements, which made it a more complex system but better controlled (Quadrant B), with improved replicability. The development of any alloy is subject to a highly controlled environment. Unknown aspects of the system, such as interactions among the components, cannot be controlled initially and can lead to failures. Once these are understood, conditions can be modified (e.g., heat treatment) to bring about higher replicability.

Physics. In physics, measurements of the electronic band gap of semiconducting and conducting materials using scanning tunneling microscopy is a highly controlled, simple system (Quadrant A). The searches for the Higgs boson and gravitational waves were separate efforts, and each required the development of large, complex experimental apparatus and careful characterization of the measurement and data analysis systems (Quadrant B). Some systems, such as radiation portal monitors, require setting thresholds for alarms without knowledge of when or if a threat will ever pass through them; the variety of potential signatures is high and there is little controllability of the system during operation (Quadrant C). Finally, a simple system with little controllability is that of precisely predicting the path of a feather dropped from a given height (Quadrant D).

Psychology. In psychology, Quadrant A includes studies of basic sensory and perceptual processes that are common to all human beings, such

as the purkinje shift (i.e., a change in sensitivity of the human eye under different levels of illumination). Quadrant D includes studies of complex social behaviors that are influenced by culture and context; for example, a study of the effects of a father’s absence on children’s ability to delay gratification revealed stronger effects among younger children ( Mischel, 1961 ).

Inherent sources of non-replicability arise in every field of science, but they can vary widely depending on the specific system undergoing study. When the sources are knowable, or arise from experimental design choices, researchers need to identify and assess these sources of uncertainty insofar as they can be estimated. Researchers need also to report on steps that were intended to reduce uncertainties inherent in the study or differ from the original study (i.e., data cleaning decisions that resulted in a different final dataset). The committee agrees with those who argue that the testing of assumptions and the characterization of the components of a study are as important to report as are the ultimate results of the study ( Plant and Hanisch, 2018 ) including studies using statistical inference and reporting p -values ( Boos and Stefanski, 2011 ). Every scientific inquiry encounters an irreducible level of uncertainty, whether this is due to random processes in the system under study, limits to researchers understanding or ability to control that system, or limitations of the ability to measure. If researchers do not adequately consider and report these uncertainties and limitations, this can contribute to non-replicability.

RECOMMENDATION 5-1: Researchers should, as applicable to the specific study, provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. Researchers should thoughtfully communicate all recognized uncertainties and estimate or acknowledge other potential sources of uncertainty that bear on their results, including stochastic uncertainties and uncertainties in measurement, computation, knowledge, modeling, and methods of analysis.

Unhelpful Sources of Non-Replicability

Non-replicability can also be the result of human error or poor researcher choices. Shortcomings in the design, conduct, and communication of a study may all contribute to non-replicability.

These defects may arise at any point along the process of conducting research, from design and conduct to analysis and reporting, and errors may be made because the researcher was ignorant of best practices, was sloppy in carrying out research, made a simple error, or had unconscious bias toward a specific outcome. Whether arising from lack of knowledge, perverse incentives, sloppiness, or bias, these sources of non-replicability

warrant continued attention because they reduce the efficiency with which science progresses and time spent resolving non-replicablity issues that are caused by these sources do not add to scientific understanding. That is, they are unhelpful in making scientific progress. We consider here a selected set of such avoidable sources of non-replication:

  • publication bias
  • misaligned incentives
  • inappropriate statistical inference
  • poor study design
  • incomplete reporting of a study

We will discuss each source in turn.

Publication Bias

Both researchers and journals want to publish new, innovative, ground-breaking research. The publication preference for statistically significant, positive results produces a biased literature through the exclusion of statistically nonsignificant results (i.e., those that do not show an effect that is sufficiently unlikely if the null hypothesis is true). As noted in Chapter 2 , there is great pressure to publish in high-impact journals and for researchers to make new discoveries. Furthermore, it may be difficult for researchers to publish even robust nonsignificant results, except in circumstances where the results contradict what has come to be an accepted positive effect. Replication studies and studies with valuable data but inconclusive results may be similarly difficult to publish. This publication bias results in a published literature that does not reflect the full range of evidence about a research topic.

One powerful example is a set of clinical studies performed on the effectiveness of tamoxifen, a drug used to treat breast cancer. In a systematic review (see Chapter 7 ) of the drug’s effectiveness, 23 clinical trials were reviewed; the statistical significance of 22 of the 23 studies did not reach the criterion of p < 0.05, yet the cumulative review of the set of studies showed a large effect (a reduction of 16% [±3] in the odds of death among women of all ages assigned to tamoxifen treatment [ Peto et al., 1988 , p. 1684]).

Another approach to quantifying the extent of non-replicability is to model the false discovery rate—that is, the number of research results that are expected to be “false.” Ioannidis (2005) developed a simulation model to do so for studies that rely on statistical hypothesis testing, incorporating the pre-study (i.e., prior) odds, the statistical tests of significance, investigator bias, and other factors. Ioannidis concluded, and used as the title of his paper,

that “most published research findings are false.” Some researchers have criticized Ioannidis’s assumptions and mathematical argument ( Goodman and Greenland, 2007 ); others have pointed out that the takeaway message is that any initial results that are statistically significant need further confirmation and validation.

Analyzing the distribution of published results for a particular line of inquiry can offer insights into potential bias, which can relate to the rate of non-replicability. Several tools are being developed to compare a distribution of results to what that distribution would look like if all claimed effects were representative of the true distribution of effects. Figure 5-3 shows how publication bias can result in a skewed view of the body of evidence when only positive results that meet the statistical significance threshold are reported. When a new study fails to replicate the previously published results—for example, if a study finds no relationship between variables when such a relationship had been shown in previously published studies—it appears to be a case of non-replication. However, if the published literature is not an accurate reflection of the state of the evidence because only positive results are regularly published, the new study could actually have replicated previous but unpublished negative results. 8

Several techniques are available to detect and potentially adjust for publication bias, all of which are based on the examination of a body of research as a whole (i.e., cumulative evidence), rather than individual replication studies (i.e., one-on-one comparison between studies). These techniques cannot determine which of the individual studies are affected by bias (i.e., which results are false positives) or identify the particular type of bias, but they arguably allow one to identify bodies of literature that are likely to be more or less accurate representations of the evidence. The techniques, discussed below, are funnel plots, a p -curve test of excess significance, and assessing unpublished literature.

Funnel Plots. One of the most common approaches to detecting publication bias involves constructing a funnel plot that displays each effect size against its precision (e.g., sample size of study). Asymmetry in the plotted values can reveal the absence of studies with small effect sizes, especially in studies with small sample sizes—a pattern that could suggest publication/selection bias for statistically significant effects (see Figure 5-3 ). There are criticisms of funnel plots, however; some argue that the shape of a funnel plot is largely determined by the choice of method ( Tang and Liu, 2000 ),

8 Earlier in this chapter, we discuss an indirect method for assessing non-replicability in which a result is compared to previously published values; results that do not agreed with the published literature are identified as outliers. If the published literature is biased, this method would inappropriately reject valid results. This is another reason for investigating outliers before rejecting them.

Image

and others maintain that funnel plot asymmetry may not accurately reflect publication bias ( Lau et al., 2006 ).

P -Curve. One fairly new approach is to compare the distribution of results (e.g., p- values) to the expected distributions (see Simonsohn et al., 2014a , 2014b ). P- curve analysis tests whether the distribution of statistically significant p- values shows a pronounced right-skew, 9 as would be expected when the results are true effects (i.e., the null hypothesis is false), or whether the distribution is not as right-skewed (or is even flat, or, in the most extreme cases, left-skewed), as would be expected when the original results do not reflect the proportion of real effects ( Gadbury and Allison, 2012 ; Nelson et al., 2018 ; Simonsohn et al., 2014a ).

Test of Excess Significance. A closely related statistical idea for checking publication bias is the test of excess significance. This test evaluates whether the number of statistically significant results in a set of studies is improbably high given the size of the effect and the power to test it in the set of studies ( Ioannidis and Trikalinos, 2007 ), which would imply that the set of results is biased and may include exaggerated results or false positives. When there is a true effect, one expects the proportion of statistically significant results to be equal to the statistical power of the studies. If a researcher designs her studies to have 80 percent power against a given effect, then, at most, 80 percent of her studies would produce statistically significant results if the effect is at least that large (fewer if the null hypothesis is sometimes true). Schimmack (2012) has demonstrated that the proportion of statistically significant results across a set of psychology studies often far exceeds the estimated statistical power of those studies; this pattern of results that is “too good to be true” suggests that results were either not obtained following the rules of statistical inference (i.e., conducting a single statistical test that was chosen a priori ) or did not report all studies attempted (i.e., there is a “file drawer” of statistically nonsignificant studies that do not get published; or possibly the results were p -hacked or cherry picked (see Chapter 2 ).

In many fields, the proportion of published papers that report a positive (i.e., statistically significant) result is around 90 percent ( Fanelli, 2012 ). This raises concerns when combined with the observation that most studies have far less than 90 percent statistical power (i.e., would only successfully detect an effect, assuming an effect exists, far less than 90% of the time) ( Button et al., 2013 ; Fraley and Vazire, 2014 ; Szucs and Ioannidis, 2017 ; Yarkoni, 2009 ; Stanley et al., 2018 ). Some researchers believe that the

9 Distributions that have more p -values of low value than high are referred to as “right-skewed.” Similarly, “left-skewed” distributions have more p -values of high than low value.

publication of false positives is common and that reforms are needed to reduce this. Others believe that there has been an excessive focus on Type I errors (i.e., false positives) in hypothesis testing at the possible expense of an increase in Type II errors (i.e., false negatives, or failing to confirm true hypotheses) ( Fiedler et al., 2012 ; Finkel et al., 2015 ; LeBel et al., 2017 ).

Assessing Unpublished Literature. One approach to countering publication bias is to search for and include unpublished papers and results when conducting a systematic review of the literature. Such comprehensive searches are not standard practice. For medical reviews, one estimate is that only 6 percent of reviews included unpublished work ( Hartling et al., 2017 ), although another found that 50 percent of reviews did so ( Ziai et al., 2017 ). In economics, there is a large and active group of researchers collecting and sharing “grey” literature, research results outside of peer reviewed publications ( Vilhuber, 2018 ). In psychology, an estimated 75 percent of reviews included unpublished research ( Rothstein, 2006 ). Unpublished but recorded studies (such as dissertation abstracts, conference programs, and research aggregation websites) may become easier for reviewers to access with computerized databases and with the availability of preprint servers. When a review includes unpublished studies, researchers can directly compare their results with those from the published literature, thereby estimating file-drawer effects.

Misaligned Incentives

Academic incentives—such as tenure, grant money, and status—may influence scientists to compromise on good research practices ( Freeman, 2018 ). Faculty hiring, promotion, and tenure decisions are often based in large part on the “productivity” of a researcher, such as the number of publications, number of citations, and amount of grant money received ( Edwards and Roy, 2017 ). Some have suggested that these incentives can lead researchers to ignore standards of scientific conduct, rush to publish, and overemphasize positive results ( Edwards and Roy, 2017 ). Formal models have shown how these incentives can lead to high rates of non-replicable results ( Smaldino and McElreath, 2016 ). Many of these incentives may be well intentioned, but they could have the unintended consequence of reducing the quality of the science produced, and poorer quality science is less likely to be replicable.

Although it is difficult to assess how widespread the sources of non-replicability that are unhelpful to improving science are, factors such as publication bias toward results qualifying as “statistically significant” and misaligned incentives on academic scientists create conditions that favor publication of non-replicable results and inferences.

Inappropriate Statistical Inference

Confirmatory research is research that starts with a well-defined research question and a priori hypotheses before collecting data; confirmatory research can also be called hypothesis testing research. In contrast, researchers pursuing exploratory research collect data and then examine the data for potential variables of interest and relationships among variables, forming a posteriori hypotheses; as such, exploratory research can be considered hypothesis generating research. Exploratory and confirmatory analyses are often described as two different stages of the research process. Some have distinguished between the “context of discovery” and the “context of justification” ( Reichenbach, 1938 ), while others have argued that the distinction is on a spectrum rather than categorical. Regardless of the precise line between exploratory and confirmatory research, researchers’ choices between the two affects how they and others interpret the results.

A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis ( de Groot, 2014 ). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate statistically significant result.

Researchers often learn from their data, and some of the most important discoveries in the annals of science have come from unexpected results that did not fit any prior theory. For example, Arno Allan Penzias and Robert Woodrow Wilson found unexpected noise in data collected in the course of their work on microwave receivers for radio astronomy observations. After attempts to explain the noise failed, the “noise” was eventually determined to be cosmic microwave background radiation, and these results helped scientists to refine and confirm theories about the “big bang.” While exploratory research generates new hypotheses, confirmatory research is equally important because it tests the hypotheses generated and can give valid answers as to whether these hypotheses have any merit. Exploratory and confirmatory research are essential parts of science, but they need to be understood and communicated as two separate types of inquiry, with two different interpretations.

A well-conducted exploratory analysis can help illuminate possible hypotheses to be examined in subsequent confirmatory analyses. Even a stark result in an exploratory analysis has to be interpreted cautiously, pending further work to test the hypothesis using a new or expanded dataset. It is often unclear from publications whether the results came from an

exploratory or a confirmatory analysis. This lack of clarity can misrepresent the reliability and broad applicability of the reported results.

In Chapter 2 , we discussed the meaning, overreliance, and frequent misunderstanding of statistical significance, including misinterpreting the meaning and overstating the utility of a particular threshold, such as p < 0.05. More generally, a number of flaws in design and reporting can reduce the reliability of a study’s results.

Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives ( John et al., 2012 ; Munafo et al., 2017 ). A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most ( Peto, 2011 ). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified.

Little information is available about the prevalence of such inappropriate statistical practices as p- hacking, cherry picking, and hypothesizing after results are known (HARKing), discussed below. While surveys of researchers raise the issue—often using convenience samples—methodological shortcomings mean that they are not necessarily a reliable source for a quantitative assessment. 10

P- hacking and Cherry Picking. P- hacking is the practice of collecting, selecting, or analyzing data until a result of statistical significance is found. Different ways to p- hack include stopping data collection once p ≤ 0.05 is reached, analyzing many different relationships and only reporting those for which p ≤ 0.05, varying the exclusion and inclusion rules for data so that p ≤ 0.05, and analyzing different subgroups in order to get p ≤ 0.05. Researchers may p- hack without knowing or without understanding the consequences ( Head et al., 2015 ). This is related to the practice of cherry picking, in which researchers may (unconsciously or deliberately) pick

10 For an example of one study of this issue, see Fraser et al. (2018) .

through their data and results and selectively report those that meet criteria such as meeting a threshold of statistical significance or supporting a positive result, rather than reporting all of the results from their research.

HARKing. Confirmatory research begins with identifying a hypothesis based on observations, exploratory analysis, or building on previous research. Data are collected and analyzed to see if they support the hypothesis. HARKing applies to confirmatory research that incorrectly bases the hypothesis on the data collected and then uses that same data as evidence to support the hypothesis. It is unknown to what extent inappropriate HARKing occurs in various disciplines, but some have attempted to quantify the consequences of HARKing. For example, a 2015 article compared hypothesized effect sizes against non-hypothesized effect sizes and found that effects were significantly larger when the relationships had been hypothesized, a finding consistent with the presence of HARKing ( Bosco et al., 2015 ).

Poor Study Design

Before conducting an experiment, a researcher must make a number of decisions about study design. These decisions—which vary depending on type of study—could include the research question, the hypotheses, the variables to be studied, avoiding potential sources of bias, and the methods for collecting, classifying, and analyzing data. Researchers’ decisions at various points along this path can contribute to non-replicability. Poor study design can include not recognizing or adjusting for known biases, not following best practices in terms of randomization, poorly designing materials and tools (ranging from physical equipment to questionnaires to biological reagents), confounding in data manipulation, using poor measures, or failing to characterize and account for known uncertainties.

In 2010, economists Carmen Reinhart and Kenneth Rogoff published an article that showed if a country’s debt exceeds 90 percent of the country’s gross domestic product, economic growth slows and declines slightly (0.1%). These results were widely publicized and used to support austerity measures around the world ( Herndon et al., 2013 ). However, in 2013, with access to Reinhart and Rogoff’s original spreadsheet of data and analysis (which the authors had saved and made available for the replication effort), researchers reanalyzing the original studies found several errors in the analysis and data selection. One error was an incomplete set of countries used in the analysis that established the relationship between debt and economic growth. When data from Australia, Austria, Belgium, Canada,

and Denmark were correctly included, and other errors were corrected, the economic growth in the countries with debt above 90 percent of gross domestic product was actually +2.2 percent, rather than –0.1. In response, Reinhart and Rogoff acknowledged the errors, calling it “sobering that such an error slipped into one of our papers despite our best efforts to be consistently careful.” Reinhart and Rogoff said that while the error led to a “notable change” in the calculation of growth in one category, they did not believe it “affects in any significant way the central message of the paper.” 11

The Reinhart and Rogoff error was fairly high profile and a quick Internet search would let any interested reader know that the original paper contained errors. Many errors could go undetected or are only acknowledged through a brief correction in the publishing journal. A 2015 study looked at a sample of more than 250,000 p- values reported in eight major psychology journals over a period of 28 years. The study found that many of the p- values reported in papers were inconsistent with a recalculation of the p- value and that in one out of eight papers, this inconsistency was large enough to affect the statistical conclusion ( Nuijten et al., 2016 ).

Errors can occur at any point in the research process: measurements can be recorded inaccurately, typographical errors can occur when inputting data, and calculations can contain mistakes. If these errors affect the final results and are not caught prior to publication, the research may be non-replicable. Unfortunately, these types of errors can be difficult to detect. In the case of computational errors, transparency in data and computation may make it more likely that the errors can be caught and corrected. For other errors, such as mistakes in measurement, errors might not be detected until and unless a failed replication that does not make the same mistake indicates that something was amiss in the original study. Errors may also be made by researchers despite their best intentions (see Box 5-2 ).

Incomplete Reporting of a Study

During the course of research, researchers make numerous choices about their studies. When a study is published, some of these choices are reported in the methods section. A methods section often covers what materials were used, how participants or samples were chosen, what data collection procedures were followed, and how data were analyzed. The failure to report some aspect of the study—or to do so in sufficient detail—may make it difficult for another researcher to replicate the result. For example, if a researcher only reports that she “adjusted for comorbidities” within the study population, this does not provide sufficient information about how

11 See https://archive.nytimes.com/www.nytimes.com/interactive/2013/04/17/business/17economixresponse.html .

exactly the comorbidities were adjusted, and it does not give enough guidance for future researchers to follow the protocol. Similarly, if a researcher does not give adequate information about the biological reagents used in an experiment, a second researcher may have difficulty replicating the experiment. Even if a researcher reports all of the critical information about the conduct of a study, other seemingly inconsequential details that have an effect on the outcome could remain unreported.

Just as reproducibility requires transparent sharing of data, code, and analysis, replicability requires transparent sharing of how an experiment was conducted and the choices that were made. This allows future researchers, if they wish, to attempt replication as close to the original conditions as possible.

Fraud and Misconduct

At the extreme, sources of non-replicability that do not advance scientific knowledge—and do much to harm science—include misconduct and fraud in scientific research. Instances of fraud are uncommon, but can be sensational. Despite fraud’s infrequent occurrence and regardless of how

highly publicized cases may be, the fact that it is uniformly bad for science means that it is worthy of attention within this study.

Researchers who knowingly use questionable research practices with the intent to deceive are committing misconduct or fraud. It can be difficult in practice to differentiate between honest mistakes and deliberate misconduct because the underlying action may be the same while the intent is not.

Reproducibility and replicability emerged as general concerns in science around the same time as research misconduct and detrimental research practices were receiving renewed attention. Interest in both reproducibility and replicability as well as misconduct was spurred by some of the same trends and a small number of widely publicized cases in which discovery of fabricated or falsified data was delayed, and the practices of journals, research institutions, and individual labs were implicated in enabling such delays ( National Academies of Sciences, Engineering, and Medicine, 2017 ; Levelt Committee et al., 2012 ).

In the case of Anil Potti at Duke University, a researcher using genomic analysis on cancer patients was later found to have falsified data. This experience prompted the study and the report, Evolution of Translational Omics: Lessons Learned and the Way Forward ( Institute of Medicine, 2012 ), which in turn led to new guidelines for omics research at the National Cancer Institute. Around the same time, in a case that came to light in the Netherlands, social psychologist Diederick Stapel had gone from manipulating to fabricating data over the course of a career with dozens of fraudulent publications. Similarly, highly publicized concerns about misconduct by Cornell University professor Brian Wansink highlight how consistent failure to adhere to best practices for collecting, analyzing, and reporting data—intentional or not—can blur the line between helpful and unhelpful sources of non-replicability. In this case, a Cornell faculty committee ascribed to Wansink: “academic misconduct in his research and scholarship, including misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship.” 12

A subsequent report, Fostering Integrity in Research ( National Academies of Sciences, Engineering, and Medicine, 2017 ), emerged in this context, and several of its central themes are relevant to questions posed in this report.

According to the definition adopted by the U.S. federal government in 2000, research misconduct is fabrication of data, falsification of data, or plagiarism “in proposing, performing, or reviewing research, or in reporting research results” ( Office of Science and Technology Policy, 2000 , p. 76262). The federal policy requires that research institutions report all

12 See http://statements.cornell.edu/2018/20180920-statement-provost-michael-kotlikoff.cfm .

allegations of misconduct in research projects supported by federal funding that have advanced from the inquiry stage to a full investigation, and to report on the results of those investigations.

Other detrimental research practices (see National Academies of Sciences, Engineering, and Medicine, 2017 ) include failing to follow sponsor requirements or disciplinary standards for retaining data, authorship misrepresentation other than plagiarism, refusing to share data or methods, and misleading statistical analysis that falls short of falsification. In addition to the behaviors of individual researchers, detrimental research practices also include actions taken by organizations, such as failure on the part of research institutions to maintain adequate policies, procedures, or capacity to foster research integrity and assess research misconduct allegations, and abusive or irresponsible publication practices by journal editors and peer review.

Just as information on rates of non-reproducibility and non-replicability in research is limited, knowledge about research misconduct and detrimental research practices is scarce. Reports of research misconduct allegations and findings are released by the National Science Foundation Office of Inspector General and the Department of Health and Human Services Office of Research Integrity (see National Science Foundation, 2018d ). As discussed above, new analyses of retraction trends have shed some light on the frequency of occurrence of fraud and misconduct. Allegations and findings of misconduct increased from the mid-2000s to the mid-2010s but may have leveled off in the past few years.

Analysis of retractions of scientific articles in journals may also shed some light on the problem ( Steen et al., 2013 ). One analysis of biomedical articles found that misconduct was responsible for more than two-thirds of retractions ( Fang et al., 2012 ). As mentioned earlier, a wider analysis of all retractions of scientific papers found about one-half attributable to misconduct or fraud ( Brainard, 2018 ). Others have found some differences according to discipline ( Grieneisen and Zhang, 2012 ).

One theme of Fostering Integrity in Research is that research misconduct and detrimental research practices are a continuum of behaviors ( National Academies of Sciences, Engineering, and Medicine, 2017 ). While current policies and institutions aimed at preventing and dealing with research misconduct are certainly necessary, detrimental research practices likely arise from some of the same causes and may cost the research enterprise more than misconduct does in terms of resources wasted on the fabricated or falsified work, resources wasted on following up this work, harm to public health due to treatments based on acceptance of incorrect clinical results, reputational harm to collaborators and institutions, and others.

No branch of science is immune to research misconduct, and the committee did not find any basis to differentiate the relative level of occurrence

in various branches of science. Some but not all researcher misconduct has been uncovered through reproducibility and replication attempts, which are the self-correcting mechanisms of science. From the available evidence, documented cases of researcher misconduct are relatively rare, as suggested by a rate of retractions in scientific papers of approximately 4 in 10,000 ( Brainard, 2018 ).

CONCLUSION 5-4: The occurrence of non-replicability is due to multiple sources, some of which impede and others of which promote progress in science. The overall extent of non-replicability is an inadequate indicator of the health of science.

This page intentionally left blank.

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 15 December 2021
  • Correction 16 December 2021

Replicating scientific results is tough — but essential

You have full access to this article via your institution.

A woman works in a lab at the Cancer Research Center of Marseille

Funders and publishers need to take replication studies much more seriously than they do at present. Credit: Anne-Christine Poujoulat/AFP/Getty

Replicabillity — the ability to obtain the same result when an experiment is repeated — is foundational to science. But in many research fields it has proved difficult to achieve. An important and much-anticipated brace of research papers now show just how complicated, time-consuming and difficult it can be to conduct and interpret replication studies in cancer biology 1 , 2 .

Nearly a decade ago, research teams organized by the non-profit Center for Open Science in Charlottesville, Virginia, and ScienceExchange, a research-services company based in Palo Alto, California, set out to systematically test whether selected experiments in highly cited papers published in prestigious scientific journals could be replicated. The effort was part of the high-profile Reproducibility Project: Cancer Biology (RPCB) initiative. The researchers assessed experimental outcomes or ‘effects’ by seven metrics, five of which could apply to numerical results. Overall, 46% of these replications were successful by three or more of these metrics, such as whether results fell within the confidence interval predicted by the experiment or retained statistical significance.

The project was launched in the wake of reports from drug companies that they could not replicate findings in many cancer-biology papers. But those reports did not identify the papers, nor the criteria for replication. The RPCB was conceived to bring research rigour to such retrospective replication studies.

Initial findings

One of the clearest findings was that the effects of an experimental treatment — such as killing cancer cells or shrinking tumours — were drastically smaller in replications, overall 85% smaller, than what had been reported originally. It’s hard to know why. There could have been statistical fluke, for example; bias in the original study or in the replication; or lack of know-how by the replicators that caused the repeated study to miss some essential quality of the original.

repeats in an experiment

Half of top cancer studies fail high-profile reproducibility effort

The project also took more than five years longer than expected, and, despite taking the extra time, the teams were able to assess experiments in only one-quarter of the experiments they had originally planned to cover. This underscores the fact that such assessments take much more time and effort than expected.

The RPCB studies were budgeted to cost US$1.3 million over three years. That was increased to $1.5 million, not including the costs of personnel or project administration.

None of the 53 papers selected contained enough detail for the researchers to repeat the experiments. So the replicators had to contact authors for information, such as how many cells were injected, by what route, or the exact reagent used. Often, these were details that even the authors could not provide because the information had not been recorded or laboratory members had moved on. And one-third of authors either refused requests for more information or did not respond. For 136 of the 193 experimental effects assessed, replicators also had to request a key reagent from the original authors (such as a cell line, plasmid or model organism) because they could not buy it or get it from a repository. Some 69% of the authors were willing to share their reagents.

Openness and precision

Since the reproducibility project began, several efforts have encouraged authors to share more-precise methodological details of their studies. Nature , along with other journals, introduced a reproducibility checklist in 2013. It requires that authors report key experimental data, such as the strain, age and sex of animals used. Authors are also encouraged to deposit their experimental protocols in repositories, so that other researchers can access them.

repeats in an experiment

Understand the real reasons reproducibility reform fails

Furthermore, the ‘Landis 4’ criteria were published in 2012 to promote rigorous animal research. They include the requirement for blinding, randomization and statistically assessed sample sizes. Registered Reports, an article format in which researchers publish the design of their studies before doing their experiments, is another key development. It means that ‘null effects’ are more likely to be published than buried in a file drawer . The project team found that null effects were more likely to be replicated; 80% of such studies passed by three metrics, compared with only 40% of ‘positive effects’.

Harder to resolve is the fact that what works in one lab might not work in another, possibly because of inherent variation or unrecognized methodological differences. Take the following example: one study tracked whether a certain type of cell contributes to blood supply in tumours 3 . Tracking these cells required that they express a ‘reporter’ molecule (in this case, green fluorescent protein). But, despite many attempts and tweaks, the replicating team couldn’t make the reporter sufficiently active in the cells to be tracked 4 , so the replication attempt was stopped.

The RPCB teams vetted replication protocols with the original authors, and also had them peer reviewed. But detailed advance agreement on experimental designs will not necessarily, on its own, account for setbacks encountered when studies are repeated — in some cases, many years after the originals. That is why another approach to replication is used by the US Defense Advanced Research Projects Agency (DARPA). In one DARPA programme, research teams are assigned independent verification teams. The research teams must help to troubleshoot and provide support for the verification teams so that key results can be obtained in another lab even before work is published. This approach is built into programme requirements: 3–8% of funds allocated for research programmes go towards such verification efforts 5 .

Such studies also show that researchers, research funders and publishers must take replication studies much more seriously. Researchers need to engage in such actions, funders must ramp up investments in these studies, and publishers, too, must play their part so that researchers can be confident that this work is important. It is laudable that the press conference announcing the project’s results included remarks and praise by the leaders of the US National Academies of Sciences, Engineering, and Medicine and the National Institutes of Health. But the project was funded by a philanthropic investment fund, Arnold Ventures in Houston, Texas.

The entire scientific community must recognize that replication is not for replication’s sake, but to gain an assurance central to the progress of science: that an observation or result is sturdy enough to spur future work. The next wave of replication efforts should be aimed at making this everyday essential easier to achieve.

Nature 600 , 359-360 (2021)

doi: https://doi.org/10.1038/d41586-021-03736-4

Updates & Corrections

Correction 16 December 2021 : This article originally mischaracterized the RPCB’s analysis of replication attempts. Rather than recording seven experimental outcomes, it assessed experimental effects using seven metrics, and it also assessed 193 experimental effects not 193 experiments.

Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. eLife 10 , e67995 (2021).

Article   PubMed   Google Scholar  

Errington, T. M. et al. eLife 10 , e71601 (2021).

Ricci-Vitiani, L. et al. Nature 468 , 824–828 (2010).

Errington, T. M. et al. eLife 10 , e73430 (2021).

Raphael, M. P., Sheehan, P. E. & Vora, G. J. Nature 579 , 190–192 (2020).

Download references

Reprints and permissions

Related Articles

repeats in an experiment

  • Research data
  • Research management
  • Institutions

No more hunting for replication studies: crowdsourced database makes them easy to find

No more hunting for replication studies: crowdsourced database makes them easy to find

Nature Index 27 AUG 24

AI firms must play fair when they use academic data in training

AI firms must play fair when they use academic data in training

Editorial 27 AUG 24

Time to refocus for South Korean science

Time to refocus for South Korean science

Nature Index 21 AUG 24

Binning out-of-date chemicals? Somebody think about the carbon!

Correspondence 27 AUG 24

Partners in drug discovery: how to collaborate with non-governmental organizations

Partners in drug discovery: how to collaborate with non-governmental organizations

Career Feature 23 AUG 24

Can South Korea regain its edge in innovation?

Can South Korea regain its edge in innovation?

What will it take to open South Korean research to the world?

What will it take to open South Korean research to the world?

How South Korea can support female research leaders

How South Korea can support female research leaders

Full-Time Faculty Member in Molecular Agrobiology at Peking University

Faculty positions in molecular agrobiology, including plant (crop) molecular biology, crop genomics and agrobiotechnology and etc.

Beijing, China

School of Advanced Agricultural Sciences, Peking University

repeats in an experiment

Multiple Tenure-Track Faculty Positions at the CU Boulder BioFrontiers Institute

Seeking an innovative and collaborative scientist or engineer to build a globally recognized, interdisciplinary research program.

Boulder, Colorado

University of Colorado Boulder BioFrontiers Institute

repeats in an experiment

Suzhou Institute of Systems Medicine Seeking High-level Talents

Full Professor, Associate Professor, Assistant Professor

Suzhou, Jiangsu, China

Suzhou Institute of Systems Medicine (ISM)

repeats in an experiment

Tenure Track Faculty Position - Division of Hematology

The Division of Hematology at Washington University in St. Louis invites applications from outstanding candidates with a Ph.D. and/or M.D. degree f...

Saint Louis, Missouri

Washingon University School of Medicine - Division of Hematology

repeats in an experiment

Lecturer/Senior Lecturer at the Dyson School of Design Engineering

About the role: Do you want to change the world for the better, and do you believe this can be done through research and education? If so, grab you...

South Kensington, London (Greater) (GB)

Imperial College London (ICL)

repeats in an experiment

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Experimental Design: Types, Examples & Methods

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Experimental design refers to how participants are allocated to different groups in an experiment. Types of design include repeated measures, independent groups, and matched pairs designs.

Probably the most common way to design an experiment in psychology is to divide the participants into two groups, the experimental group and the control group, and then introduce a change to the experimental group, not the control group.

The researcher must decide how he/she will allocate their sample to the different experimental groups.  For example, if there are 10 participants, will all 10 participants participate in both groups (e.g., repeated measures), or will the participants be split in half and take part in only one group each?

Three types of experimental designs are commonly used:

1. Independent Measures

Independent measures design, also known as between-groups , is an experimental design where different participants are used in each condition of the independent variable.  This means that each condition of the experiment includes a different group of participants.

This should be done by random allocation, ensuring that each participant has an equal chance of being assigned to one group.

Independent measures involve using two separate groups of participants, one in each condition. For example:

Independent Measures Design 2

  • Con : More people are needed than with the repeated measures design (i.e., more time-consuming).
  • Pro : Avoids order effects (such as practice or fatigue) as people participate in one condition only.  If a person is involved in several conditions, they may become bored, tired, and fed up by the time they come to the second condition or become wise to the requirements of the experiment!
  • Con : Differences between participants in the groups may affect results, for example, variations in age, gender, or social background.  These differences are known as participant variables (i.e., a type of extraneous variable ).
  • Control : After the participants have been recruited, they should be randomly assigned to their groups. This should ensure the groups are similar, on average (reducing participant variables).

2. Repeated Measures Design

Repeated Measures design is an experimental design where the same participants participate in each independent variable condition.  This means that each experiment condition includes the same group of participants.

Repeated Measures design is also known as within-groups or within-subjects design .

  • Pro : As the same participants are used in each condition, participant variables (i.e., individual differences) are reduced.
  • Con : There may be order effects. Order effects refer to the order of the conditions affecting the participants’ behavior.  Performance in the second condition may be better because the participants know what to do (i.e., practice effect).  Or their performance might be worse in the second condition because they are tired (i.e., fatigue effect). This limitation can be controlled using counterbalancing.
  • Pro : Fewer people are needed as they participate in all conditions (i.e., saves time).
  • Control : To combat order effects, the researcher counter-balances the order of the conditions for the participants.  Alternating the order in which participants perform in different conditions of an experiment.

Counterbalancing

Suppose we used a repeated measures design in which all of the participants first learned words in “loud noise” and then learned them in “no noise.”

We expect the participants to learn better in “no noise” because of order effects, such as practice. However, a researcher can control for order effects using counterbalancing.

The sample would be split into two groups: experimental (A) and control (B).  For example, group 1 does ‘A’ then ‘B,’ and group 2 does ‘B’ then ‘A.’ This is to eliminate order effects.

Although order effects occur for each participant, they balance each other out in the results because they occur equally in both groups.

counter balancing

3. Matched Pairs Design

A matched pairs design is an experimental design where pairs of participants are matched in terms of key variables, such as age or socioeconomic status. One member of each pair is then placed into the experimental group and the other member into the control group .

One member of each matched pair must be randomly assigned to the experimental group and the other to the control group.

matched pairs design

  • Con : If one participant drops out, you lose 2 PPs’ data.
  • Pro : Reduces participant variables because the researcher has tried to pair up the participants so that each condition has people with similar abilities and characteristics.
  • Con : Very time-consuming trying to find closely matched pairs.
  • Pro : It avoids order effects, so counterbalancing is not necessary.
  • Con : Impossible to match people exactly unless they are identical twins!
  • Control : Members of each pair should be randomly assigned to conditions. However, this does not solve all these problems.

Experimental design refers to how participants are allocated to an experiment’s different conditions (or IV levels). There are three types:

1. Independent measures / between-groups : Different participants are used in each condition of the independent variable.

2. Repeated measures /within groups : The same participants take part in each condition of the independent variable.

3. Matched pairs : Each condition uses different participants, but they are matched in terms of important characteristics, e.g., gender, age, intelligence, etc.

Learning Check

Read about each of the experiments below. For each experiment, identify (1) which experimental design was used; and (2) why the researcher might have used that design.

1 . To compare the effectiveness of two different types of therapy for depression, depressed patients were assigned to receive either cognitive therapy or behavior therapy for a 12-week period.

The researchers attempted to ensure that the patients in the two groups had similar severity of depressed symptoms by administering a standardized test of depression to each participant, then pairing them according to the severity of their symptoms.

2 . To assess the difference in reading comprehension between 7 and 9-year-olds, a researcher recruited each group from a local primary school. They were given the same passage of text to read and then asked a series of questions to assess their understanding.

3 . To assess the effectiveness of two different ways of teaching reading, a group of 5-year-olds was recruited from a primary school. Their level of reading ability was assessed, and then they were taught using scheme one for 20 weeks.

At the end of this period, their reading was reassessed, and a reading improvement score was calculated. They were then taught using scheme two for a further 20 weeks, and another reading improvement score for this period was calculated. The reading improvement scores for each child were then compared.

4 . To assess the effect of the organization on recall, a researcher randomly assigned student volunteers to two conditions.

Condition one attempted to recall a list of words that were organized into meaningful categories; condition two attempted to recall the same words, randomly grouped on the page.

Experiment Terminology

Ecological validity.

The degree to which an investigation represents real-life experiences.

Experimenter effects

These are the ways that the experimenter can accidentally influence the participant through their appearance or behavior.

Demand characteristics

The clues in an experiment lead the participants to think they know what the researcher is looking for (e.g., the experimenter’s body language).

Independent variable (IV)

The variable the experimenter manipulates (i.e., changes) is assumed to have a direct effect on the dependent variable.

Dependent variable (DV)

Variable the experimenter measures. This is the outcome (i.e., the result) of a study.

Extraneous variables (EV)

All variables which are not independent variables but could affect the results (DV) of the experiment. Extraneous variables should be controlled where possible.

Confounding variables

Variable(s) that have affected the results (DV), apart from the IV. A confounding variable could be an extraneous variable that has not been controlled.

Random Allocation

Randomly allocating participants to independent variable conditions means that all participants should have an equal chance of taking part in each condition.

The principle of random allocation is to avoid bias in how the experiment is carried out and limit the effects of participant variables.

Order effects

Changes in participants’ performance due to their repeating the same or similar test more than once. Examples of order effects include:

(i) practice effect: an improvement in performance on a task due to repetition, for example, because of familiarity with the task;

(ii) fatigue effect: a decrease in performance of a task due to repetition, for example, because of boredom or tiredness.

Print Friendly, PDF & Email

Why is Replication in Research Important?

Replication in research is important because it allows for the verification and validation of study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims.

Updated on June 30, 2023

researchers replicating a study

Often viewed as a cornerstone of science , replication builds confidence in the scientific merit of a study’s results. The philosopher Karl Popper argued that, “we do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them.”

As such, creating the potential for replication is a common goal for researchers. The methods section of scientific manuscripts is vital to this process as it details exactly how the study was conducted. From this information, other researchers can replicate the study and evaluate its quality.

This article discusses replication as a rational concept integral to the philosophy of science and as a process validating the continuous loop of the scientific method. By considering both the ethical and practical implications, we may better understand why replication is important in research.

What is replication in research?

As a fundamental tool for building confidence in the value of a study’s results, replication has power. Some would say it has the power to make or break a scientific claim when, in reality, it is simply part of the scientific process, neither good nor bad.

When Nosek and Errington propose that replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research, they revive its neutrality. The true purpose of replication, therefore, is to advance scientific discovery and theory by introducing new evidence that broadens the current understanding of a given question.

Why is replication important in research?

The great philosopher and scientist, Aristotle , asserted that a science is possible if and only if there are knowable objects involved. There cannot be a science of unicorns, for example, because unicorns do not exist. Therefore, a ‘science’ of unicorns lacks knowable objects and is not a ‘science’.

This philosophical foundation of science perfectly illustrates why replication is important in research. Basically, when an outcome is not replicable, it is not knowable and does not truly exist. Which means that each time replication of a study or a result is possible, its credibility and validity expands.

The lack of replicability is just as vital to the scientific process. It pushes researchers in new and creative directions, compelling them to continue asking questions and to never become complacent. Replication is as much a part of the scientific method as formulating a hypothesis or making observations.

Types of replication

Historically, replication has been divided into two broad categories: 

  • Direct replication : performing a new study that follows a previous study’s original methods and then comparing the results. While direct replication follows the protocols from the original study, the samples and conditions, time of day or year, lab space, research team, etc. are necessarily different. In this way, a direct replication uses empirical testing to reflect the prevailing beliefs about what is needed to produce a particular finding.
  • Conceptual replication : performing a study that employs different methodologies to test the same hypothesis as an existing study. By applying diverse manipulations and measures, conceptual replication aims to operationalize a study’s underlying theoretical variables. In doing so, conceptual replication promotes collaborative research and explanations that are not based on a single methodology.

Though these general divisions provide a helpful starting point for both conducting and understanding replication studies, they are not polar opposites. There are nuances that produce countless subcategories such as:

  • Internal replication : when the same research team conducts the same study while taking negative and positive factors into account
  • Microreplication : conducting partial replications of the findings of other research groups
  • Constructive replication : both manipulations and measures are varied
  • Participant replication : changes only the participants

Many researchers agree these labels should be confined to study design, as direction for the research team, not a preconceived notion. In fact, Nosek and Errington conclude that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge.

How do researchers replicate a study?

Like all research studies, replication studies require careful planning. The Open Science Framework (OSF) offers a practical guide which details the following steps:

  • Identify a study that is feasible to replicate given the time, expertise, and resources available to the research team.
  • Determine and obtain the materials used in the original study.
  • Develop a plan that details the type of replication study and research design intended.
  • Outline and implement the study’s best practices.
  • Conduct the replication study, analyze the data, and share the results.

These broad guidelines are expanded in Brown’s and Wood’s article , “Which tests not witch hunts: a diagnostic approach for conducting replication research.” Their findings are further condensed by Brown into a blog outlining four main procedural categories:

  • Assumptions : identifying the contextual assumptions of the original study and research team
  • Data transformations : using the study data to answer questions about data transformation choices by the original team
  • Estimation : determining if the most appropriate estimation methods were used in the original study and if the replication can benefit from additional methods
  • Heterogeneous outcomes : establishing whether the data from an original study lends itself to exploring separate heterogeneous outcomes

At the suggestion of peer reviewers from the e-journal Economics, Brown elaborates with a discussion of what not to do when conducting a replication study that includes:

  • Do not use critiques of the original study’s design as  a basis for replication findings.
  • Do not perform robustness testing before completing a direct replication study.
  • Do not omit communicating with the original authors, before, during, and after the replication.
  • Do not label the original findings as errors solely based on different outcomes in the replication.

Again, replication studies are full blown, legitimate research endeavors that acutely contribute to scientific knowledge. They require the same levels of planning and dedication as any other study.

What happens when replication fails?

There are some obvious and agreed upon contextual factors that can result in the failure of a replication study such as: 

  • The detection of unknown effects
  • Inconsistencies in the system
  • The inherent nature of complex variables
  • Substandard research practices
  • Pure chance

While these variables affect all research studies, they have particular impact on replication as the outcomes in question are not novel but predetermined.

The constant flux of contexts and variables makes assessing replicability, determining success or failure, very tricky. A publication from the National Academy of Sciences points out that replicability is obtaining consistent , not identical, results across studies aimed at answering the same scientific question. They further provide eight core principles that are applicable to all disciplines.

While there is no straightforward criteria for determining if a replication is a failure or a success, the National Library of Science and the Open Science Collaboration suggest asking some key questions, such as:

  • Does the replication produce a statistically significant effect in the same direction as the original?
  • Is the effect size in the replication similar to the effect size in the original?
  • Does the original effect size fall within the confidence or prediction interval of the replication?
  • Does a meta-analytic combination of results from the original experiment and the replication yield a statistically significant effect?
  • Do the results of the original experiment and the replication appear to be consistent?

While many clearly have an opinion about how and why replication fails, it is at best a null statement and at worst an unfair accusation. It misses the point, sidesteps the role of replication as a mechanism to further scientific endeavor by presenting new evidence to an existing question.

Can the replication process be improved?

The need to both restructure the definition of replication to account for variations in scientific fields and to recognize the degrees of potential outcomes when comparing the original data, comes in response to the replication crisis . Listen to this Hidden Brain podcast from NPR for an intriguing case study on this phenomenon.

Considered academia’s self-made disaster, the replication crisis is spurring other improvements in the replication process. Most broadly, it has prompted the resurgence and expansion of metascience , a field with roots in both philosophy and science that is widely referred to as "research on research" and "the science of science." By holding a mirror up to the scientific method, metascience is not only elucidating the purpose of replication but also guiding the rigors of its techniques.

Further efforts to improve replication are threaded throughout the industry, from updated research practices and study design to revised publication practices and oversight organizations, such as:

  • Requiring full transparency of the materials and methods used in a study
  • Pushing for statistical reform , including redefining the significance of the p-value
  • Using pre registration reports that present the study’s plan for methods and analysis
  • Adopting result-blind peer review allowing journals to accept a study based on its methodological design and justifications, not its results
  • Founding organizations like the EQUATOR Network that promotes transparent and accurate reporting

Final thoughts

In the realm of scientific research, replication is a form of checks and balances. Neither the probability of a finding nor prominence of a scientist makes a study immune to the process.

And, while a single replication does not validate or nullify the original study’s outcomes, accumulating evidence from multiple replications does boost the credibility of its claims. At the very least, the findings offer insight to other researchers and enhance the pool of scientific knowledge.

After exploring the philosophy and the mechanisms behind replication, it is clear that the process is not perfect, but evolving. Its value lies within the irreplaceable role it plays in the scientific method. Replication is no more or less important than the other parts, simply necessary to perpetuate the infinite loop of scientific discovery.

Charla Viera, MS

See our "Privacy Policy"

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Difference between replication and repeated measurements

The following quote is from Montgomery's Experimental Design:

There is an important distinction between replication and repeated measurements . For example, suppose that a silicon wafer is etched in a single-wafer plasma etching process, and a critical dimension on this wafer is measured three times. These measurements are not replicates; they are a form of repeated measurements, and in this case, the observed variability in the three repeated measurements is a direct reflection of the inherent variability in the measurement system or gauge. As another illustration, suppose that as part of an experiment in semiconductor manufacturing, four wafers are processed simultaneously in an oxidation furnace at a particular gas flow rate and time and then a measurement is taken on the oxide thickness of each wafer. Once again, the measurement on the four wafers are not replicates but repeated measurements. In this case they reflect differences among the wafers and other sources of variability within that particular furnace run. Replication reflects sources of variability both between runs and (potentially) within runs.

I don't quite understand the difference between replication and repeated measurements. Wikipedia says:

The repeated measures design (also known as a within-subjects design) uses the same subjects with every condition of the research, including the control.

According to Wikipedia, the two examples are in Montgomery's book aren't repeated measurement experiments.

In the first example, the wafer is used under only one condition, isn't it?

In the second example, each wafer is used with only one condition: "processed simultaneously in an oxidation furnace at a particular gas flow rate and time", is it?

"Replication reflects sources of variability both between runs and (potentially) within runs". Then what is for repeated measurements?

  • experiment-design
  • terminology

gung - Reinstate Monica's user avatar

  • 2 $\begingroup$ Simply put, replication involves same technique on different sample $\endgroup$ –  user36297 Commented Dec 17, 2013 at 3:13

5 Answers 5

I don't think his second example is replication OR repeated measurements.

Any study involves multiple cases (subjects, people, silicon chips, whatever).

Repeated measures involves measuring the same cases multiple times. So, if you measured the chips, then did something to them, then measured them again, etc it would be repeated measures.

Replication involves running the same study on different subjects but identical conditions. So, if you did the study on n chips, then did it again on another n chips that would be replication.

Peter Flom's user avatar

  • 4 $\begingroup$ How about an almost-mnemonic: you can replicate conditions but not subjects, though you can repeat a measurement on the same subject. $\endgroup$ –  Wayne Commented Mar 5, 2014 at 22:23

Unfortunately, terminology varies quite a bit and in confusing ways, especially between disciplines. There will be many people who will use different terms for the same thing, and/or the same terms for different things (this is a pet peeve of mine). I gather the book in question is this . This is design of experiments from the perspective of engineering (as opposed to the biomedical or the social science perspectives). The Wikipedia entry seems to be coming from the biomedical / social science perspective.

In engineering , an experimental run is typically thought of as having set up your equipment and run it. This produces, in a sense, one data point. Running your experiment again is a replication ; it gets you a second data point. In a biomedical context, you run an experiment and get $N$ data. Someone else replicates your experiment on a new sample with another $N'$ data. These constitute different ways of thinking about what you call an "experimental run". Tragically, they are very confusing.

Montgomery is referring to multiple data from the same run as "repeated measurements". Again, this is common in engineering. A way to think about this from outside the engineering context is to think about a hierarchical analysis, where you are interested in estimating and drawing inferences about the level 2 units . That is, treatments are randomly assigned to doctors and every patient (on whom you take a measurement) is a repeated measurement with respect to the doctor . Within the same doctor, those measurements "reflect differences among the wafers [patients] and other sources of variability within that particular furnace run [doctor's care]".

  • $\begingroup$ (+1) Montgomery is referring to multiple data from the same run as "repeated measures" -- the quote actually says "repeated measurements". Is this slight difference in wording important? $\endgroup$ –  amoeba Commented Mar 5, 2014 at 21:24
  • $\begingroup$ Thanks for the catch, @amoeba. I'm used to saying / thinking / typing "repeated measures". It was just a slip of the fingers. $\endgroup$ –  gung - Reinstate Monica Commented Mar 5, 2014 at 21:26
  • $\begingroup$ So just to be clear: Montgomery's "repeated measurements" of wafers are not "repeated measures" of wafers, right? I would say that your answer lacks this stated explicitly. You say that Montgomery's "repeated measurements" can be interpreted as repeated measures with respect to furnaces (fair enough), but furnaces are not the object of study in this quote; wafers are. $\endgroup$ –  amoeba Commented Mar 5, 2014 at 21:30
  • 1 $\begingroup$ @amoeba, off the top of my head, I'm not sure what corresponds to what I would call "repeated measures" in the engineering perspective on DoE. I suppose you could say "Montgomery's 'repeated measurements' can be interpreted as repeated measures with respect to furnaces (fair enough), but furnaces are not the object of study in this quote; wafers are", but M's point is that the repeated measurements are information about "differences among the wafers and other sources of variability within that particular furnace run". Identifying sources of variability is the point of DoE in engineering. $\endgroup$ –  gung - Reinstate Monica Commented Mar 5, 2014 at 22:06
  • 1 $\begingroup$ Imagine you manufacture gears to be used in a machine. The gears must be 3.000 cm in diameter. If they are too small, there will be play in the gears & they will wear out prematurely, shortening the life of the machine. If they are too large, they will cause the machine to seize up & explode, potentially causing other damage or injury. The idea is to identify sources of variability (& subsequently determine how to control them). This is different from biomedical experiments in which the idea is to find viable treatments. $\endgroup$ –  gung - Reinstate Monica Commented Mar 5, 2014 at 22:09

What's going on here is the confusion in terminology. Here in the book, measurements refer to a single experimental trial observation , and the experiment calls for several observations to be made.

The term ' repeated measures ' refers to measuring subjects in multiple conditions .

That is, in a within-subject design (aka crossed design, or repeated measures), you have, say, two conditions: a treatment and a control, and each subject goes through both conditions, usually in a counter-balanced way. This means that you have subjects act as their own control, and this design helps you deal with between-subject variability. One disadvantage of this research design is the problem of carryover effects, where the first condition that the subject goes through adversely influences the other condition.

In other words, don't confuse 'repeated measures' and multiple observations under the same experimental condition.

See also: Are Measurements made on the same patient independent?

Community's user avatar

  • $\begingroup$ (+1) Do you mean that by "repeated measurements" Montgomery did not mean "repeated measures"? I think it's exactly what you mean, and I agree, but I find that your wording could be a bit more explicit about that. $\endgroup$ –  amoeba Commented Mar 5, 2014 at 21:20

http://blog.gembaacademy.com/2007/05/08/repetitions-versus-replications/ Repetitions versus Replications May 8, 2007 By Ron 6 Comments Many Six Sigma practitioners struggle to differentiate between a repetition and replication. Normally this confusion arises when dealing with Design of Experiments (DOE).

Let’s use an example to explain the difference.

Sallie wants to run a DOE in her paint booth. After some brainstorming and data analysis she decides to experiment with the “fluid flow” and “attack angle” of the paint gun. Since she has 2 factors and wants to test a “high” and “low” level for each factor she decides on a 2 factor, 2 level full factorial DOE. Here is what this basic design would look like.

Now then, Sallie decides to paint 6 parts during each run. Since there are 4 runs she needs at least 24 parts (6 x 4). These 6 parts per run are what we call repetitions. Here is what the design looks like with the 6 repetitions added to the design.

Finally, since this painting process is ultra critical to her company Sallie decides to do the entire experiment twice. This helps her add some statistical power and serves as a sort of confirmation. If she wanted to she could do the first 4 runs with the day shift staff and the second 4 runs with the night shift staff.

Completing the DOE a second time is what we call replication. You may also hear the term blocking used instead of replicating. Here is what the design looks like with the 6 repetitions and replication in place (in yellow).

So there you have it! That is the difference between repetition and replication.

Nancy's user avatar

  • $\begingroup$ Not able to see the figures you are referring to. $\endgroup$ –  user3024069 Commented Mar 11, 2021 at 7:17

Let me add an interesting factor, lot . In the above example, instead of making six tests with the same lot of paint (which, per above definitions means six repetitions per combination of conditions) she tests with six different paint lots per combination of conditions, which means also 24 total experiments; does this mean she is doing six replications per combination of conditions? Another example: A liquid pigment is measured for color intensity I . The lab method of analysis has two factors: suspension clarification time "T" and sample size W . Each factor has two levels, i.e, short and long T, and small and large W. That makes a 2x2 design. Testing the same lot sample under the four different conditions means there are 4 experiments in total, no repetitions. Testing the same lot twice each time means there would be two repetitions per condition, 8 experiments in total. But what if we test samples from six different lots per condition? Does this mean there are six replications per combination or conditions? The number of experiments would be 24. Now, we may want to make the method more precise and ask the lab technician to repeat the test twice (from the same sample) every time he makes a measurement, and report only the average per lot sample. I assume we could use the averages as a single result per lot sample, and for DoE, say a 2-way layout ANOVA with replications, each lot sample result is a replication . Please comment.

Guillermo Limon's user avatar

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged experiment-design terminology or ask your own question .

  • Featured on Meta
  • Bringing clarity to status tag usage on meta sites
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • What explanations can be offered for the extreme see-sawing in Montana's senate race polling?
  • Historical U.S. political party "realignments"?
  • How can these humans cross the ocean(s) at the first possible chance?
  • Change output language of internal commands like "lpstat"?
  • Why does a halfing's racial trait lucky specify you must use the next roll?
  • Trying to find an old book (fantasy or scifi?) in which the protagonist and their romantic partner live in opposite directions in time
  • What is the name of the book about a boy dressed in layers of clothes who is actually a mouse?
  • How much payload could the falcon 9 send to geostationary orbit?
  • The answer is not wrong
  • `Drop` for list of elements of different dimensions
  • Odd string in output ̄[2000/12/14 v1.0] in tufte style
  • Purpose of burn permit?
  • 3 Aspects of Voltage that contradict each other
  • Philosophies about how childhood beliefs influence / shape adult thoughts
  • Can you solve this median geometry problem?
  • Simple casino game
  • Dress code for examiner in UK PhD viva
  • Why do National Geographic and Discovery Channel broadcast fake or pseudoscientific programs?
  • I don’t know what to buy! Again!
  • How is yield calculated for a portfolio?
  • Variable usage in arithmetic expansions
  • Should I report a review I suspect to be AI-generated?
  • Why doesn't the world fill with time travelers?
  • How can you trust a forensic scientist to have maintained the chain of custody?

repeats in an experiment

repeats in an experiment

Improving Experimental Precision with Replication: A Comprehensive Guide

Updated: June 21, 2023 by Ken Feldman

repeats in an experiment

Replication is the non consecutive running of the experimental design multiple times. The purpose is to provide additional information and degrees of freedom to better understand and estimate the variation in the experiment. It is not the same as repetition. Let’s learn a little bit more about this.

Overview: What is replication?

Three important concepts in Design of Experiments (DOE) are randomization , repetition and replication. In DOE you identify potential factors which, if set at different levels, will impact some desired response variable. This allows you to predict the outcome of your response variable based on the optimal settings for your factors and levels.

Dependent upon the number of factors and levels, your DOE will be run as combinations of the factors and levels . The order of those runs should be randomized to block out any unwarranted noise in the experiment. Doing multiple runs of the same combinations will provide more data and a better estimate of the variation.

If you do consecutive runs of a specific combination you call that repetition which does not add any additional understanding of your variation. But, if you run multiple combinations of factors and levels non sequentially, you are now doing replication. Repetition would be equivalent to repeatability in Measurement System Analysis (MSA) while replication would be equivalent to reproducibility .

In summary, repetition and replication are both multiple response measurements taken at the same combination of factor settings. Repeat measurements are taken during the same experimental run or consecutive runs. Replicate measurements are taken during identical but different experimental runs.

An industry example of replication

Here is what a non replicated and replicated design matrix looks like for a full factorial three factor/two level randomized experiment using 3 replicates which a Six Sigma Black Belt wanted to run.

Replication chart

Note, that in most cases, the consecutive replicated runs below are not identical.

Replication chart

Frequently Asked Questions (FAQ) about replication

What is the difference between replication and repetition.

Both are repeated runs of your combination of factors and levels. Repetition does the duplicate runs consecutively while replication does multiple runs but during identical but different experimental runs.

What is the purpose of doing experimental replication?

The addition of replicated runs will provide more information about the variability of the process and will reflect the variation of the setup between runs.

Does replication affect the power of an experiment?

Yes. The more replicated runs you have, the more data  you will gather during your experiment. The increased data and understanding of your variation will allow you to increase your power and improve your precision and ability to spot the effect of your factors and levels on your response variable.

About the Author

' src=

Ken Feldman

How Many Times Should an Experiment be Replicated?

Cite this chapter.

repeats in an experiment

  • Natalia Juristo 2 &
  • Ana M. Moreno 2  

356 Accesses

1 Citations

An important decision in any problem of experimental design is to determine how many times an experiment should be replicated. Note that we are referring to the internal replication of an experiment. Generally, the more it is replicated, the more accurate the results of the experiment will be. However, resources tend to be limited, which places constraints on the number of replications. In this chapter, we will consider several methods for determining the best number of replications for a given experiment. We will focus on one-factor designs, but the general-purpose methodology can be extended to more complex experimental situations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Author information

Authors and affiliations.

Universidad Politecnica de Madrid, Spain

Natalia Juristo & Ana M. Moreno

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media New York

About this chapter

Juristo, N., Moreno, A.M. (2001). How Many Times Should an Experiment be Replicated?. In: Basics of Software Engineering Experimentation. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3304-4_15

Download citation

DOI : https://doi.org/10.1007/978-1-4757-3304-4_15

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4419-5011-6

Online ISBN : 978-1-4757-3304-4

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Infect Immun
  • v.78(12); 2010 Dec

Logo of iai

Reproducible Science ▿

Arturo casadevall.

Editor in Chief, mBio Departments of Microbiology & Immunology and Medicine Albert Einstein College of Medicine, Bronx, New York

Editor in Chief, Infection and Immunity Departments of Laboratory Medicine and Microbiology University of Washington School of Medicine, Seattle, Washington

Ferric C. Fang

The reproducibility of an experimental result is a fundamental assumption in science. Yet, results that are merely confirmatory of previous findings are given low priority and can be difficult to publish. Furthermore, the complex and chaotic nature of biological systems imposes limitations on the replicability of scientific experiments. This essay explores the importance and limits of reproducibility in scientific manuscripts.

“Non-reproducible single occurrences are of no significance to science.” —Karl Popper ( 18 )

There may be no more important issue for authors and reviewers than the question of reproducibility, a bedrock principle in the conduct and validation of experimental science. Consequently, readers, reviewers, and editors of Infection and Immunity can rightfully expect to see information regarding the reproducibility of experiments in the pages of this journal. Articles may describe findings with a statement that an experiment was repeated a specific number of times, with similar results. Alternatively, depending upon the nature of the experiment, the results from multiple experimental replicates might be presented individually or in combined fashion, along with an indication of experiment-to-experiment variability. For most types of experiment, there is an unstated requirement that the work be reproducible, at least once, in an independent experiment, with a strong preference for reproducibility in at least three experiments. The assumption that experimental findings are reproducible is a key criterion for acceptance of a manuscript, and the Instructions to Authors insist that “the Materials and Methods section should include sufficient technical information to allow the experiments to be repeated.”

In prior essays, we have explored the adjectives descriptive ( 6 ), mechanistic ( 7 ), and important ( 8 ) as they apply to biology, and experimental science, in particular. In this essay, we explore the problem of reproducibility in science, with emphasis on the type of science is that routinely reported in Infection and Immunity . In exploring the topic of reproducibility, it is useful to first consider terminology. “Reproducibility” is defined by the Oxford English Dictionary as “the extent to which consistent results are obtained when produced repeatedly.” Although it is taken for granted that scientific experiments should be reproducible, it is worth remembering that irreproducible one-time events can still be a tremendously important source of scientific information. This is particularly true for observational sciences in which inferences are made from events and processes not under an observer's control. For example, the collision of comet Shoemaker-Levy with Jupiter in July 1994 provided a bonanza of information on Jovian atmospheric dynamics and prima facie evidence for the threat of meteorite and comet impacts. Consequently, the criterion of reproducibility is not an essential requirement for the value of scientific information, at least in some fields. Scientists studying the evolution of life on earth must contend with their inability to repeat that magnificent experiment. Gould famously observed that if one were to “rewind the tape of life,” the results would undoubtedly be different, with the likely outcome that nothing resembling ourselves would exist ( 12 ). (Note for younger scientists: it used to be fashionable to record sounds and images on metal oxide-coated tape and play them back on devices called “tape players.”) This is supported by the importance of stochastic and contingent events in experimental evolutionary systems ( 4 ).

Given the requirement for reproducibility in experimental science, we face two apparent contradictions. First, published science is expected to be reproducible, yet most scientists are not interested in replicating published experiments or reading about them. Many reputable journals, including Infection and Immunity , are unlikely to accept manuscripts that precisely replicate published findings, despite the explicit requirement that experimental protocols must be reported in sufficient detail to allow repetition. This leads to a second paradox that published science is assumed to be reproducible, yet only rarely is the reproducibility of such work tested or known. In fact, the emphasis on reproducing experimental results becomes important only when work becomes controversial or called into doubt. Replication can even be hazardous. The German scientist Georg Wilhelm Reichmann was fatally electrocuted during an attempt to reproduce Ben Franklin's famous experiment with lightning ( 1 ). The assumption that science must be reproducible is implicit yet seldom tested, and in many systems the true reproducibility of experimental data is unknown or has not been rigorously investigated in a systematic fashion. Hence, the solidity of this bedrock assumption of experimental science lies largely in the realm of belief and trust in the integrity of the authors.

Reproducibility versus replicability.

Although many biological scientists intuitively believe that the reproducibility of an experiment means that it can be replicated, Drummond makes a distinction between these two terms ( 9 ). Drummond argues that reproducibility requires changes, whereas replicability avoids them ( 9 ). In other words, reproducibility refers to a phenomenon that can be predicted to recur even when experimental conditions may vary to some degree. On the other hand, replicability describes the ability to obtain an identical result when an experiment is performed under precisely identical conditions. For biological scientists, this would appear to be an important distinction with everyday implications. For example, consider a lab attempting to reproduce another lab's finding that a certain bacterial gene confers a certain phenotype. Such an experiment might involve making gene-deficient variants, observing the effects of gene deletion on the phenotype, and, if phenotypic changes are apparent, then going further to show that gene complementation restores the original phenotype. Given a high likelihood of microevolution in microbial strains and the possibility that independently synthesized gene disruption and replacement cassettes may have subtly different effects, then the attempt to reproduce findings does not necessarily involve a precise replication of the original experiment. Nevertheless, if the results from both laboratories are concordant, then the experiment is considered to be successfully reproduced, despite the fact that, according to Drummond's distinction, it was never replicated. On the other hand, if the results differ, a myriad of possible explanations must be considered, some of which relate to differences in experimental protocols. Hence, it would seem that scientists are generally interested in the reproducibility of results rather than the precise replication of experimental results. Some variation of conditions is considered desirable because obtaining the same result without absolutely faithful replication of the experimental conditions implies a certain robustness of the original finding. In this example, the replicatibility of the original experiment following the exact protocols initially reported would be important only if all subsequent attempts to reproduce the result were unsuccessful. When findings are so dependent on precise experimental conditions that replicatibility is needed for reproducibility, the result may be idiosyncratic and less important than a phenomenon that can be reproduced by a variety of independent, nonidentical approaches.

Replicability requirement for individual studies.

Given the difference between reproducibility and replicability that depends on whether experimental conditions are subject to variation, it is apparent that when most papers state that data are reproducible, they actually mean that the experiment has been replicated. On the other hand, when different laboratories report the confirmation of a phenomenon, it is likely that this reflects reproducibility, since experimental variability between labs is likely to result in some variable(s) being changed. In fact, depending on the number of variables involved, replicability may be achievable only in the original laboratory and possibly by the same experimenter. This accounts for the greater confidence one has in a scientific observation that has been corroborated by independent observers.

The desirability of replicability in experimental science leads to the practical question of how many times an experiment should be replicated before publication. Most reviewers would demand at least one replication, while preferring more. In this situation, the replicability of an experiment provides assurance that the effect is not due to chance alone or an experimental artifact resulting in a one-time event. Ideally, an experiment should be repeated multiple times before it is reported, with the caveat that for some experiments the expense of this approach may be prohibitive. Guidelines for experimentation with vertebrate animals also discourage the use of unnecessary duplication ( 10 , 17 ). In fact, some institutions may explicitly prohibit the practice of repeating animal experiments that reproduce published results. We agree with the need to repeat experiments but suggest that authors strive for reproducibility instead of simple replicability. For example, consider an experiment in which a particular variable, the level of a specific antibody, is believed to account for a specific experimental outcome, resistance to a microbial pathogen. Passive administration of the immunoglobulin can be used to provide protection and support the hypothesis. Rather than simply replicating this experiment, the investigator might more fruitfully conduct a dose-response experiment to determine the effect of various antibody doses or microbial inocula and test multiple strains rather than simply carrying out multiple replicates of the original experiment.

Limits of replicability and reproducibility.

Although the ability of an investigator to confirm an experimental result is essential to good science, with an inherent assumption of reproducibility, we note that there are practical and philosophical limits to the replicability and reproducibility of findings. Although to our knowledge this question has not been formally studied, replicability is likely to be inversely proportional to the number of variables in an experiment. This is all too apparent in clinical studies, leading Ioannidis to conclude that most published research findings are false ( 13 ). Statistical analysis and meta-analysis would not be required if biological experiments were precisely replicatable. Initial results from genetic association studies are frequently unconfirmed by follow-up analyses ( 14 ), clinical trials based on promising preclinical studies frequently fail ( 16 ), and a recent paper reported that only a minority of published microarray results could be repeated ( 15 ). Such observations have even led some to question the validity of the requirement for replication in science ( 21 ).

Every variable contains a certain degree of error. Since error propagates linearly or nonlinearly depending on the system, one may conclude that the more variables involved, the more errors can be expected, thus reducing the replicability of an experiment. Scientists may attempt to control variables in order to achieve greater reproducibility but must remember that as they do so, they may progressively depart from the heterogeneity of real life. In our hypothetical experiment relating specific antibody to host resistance, errors in antibody concentration, inoculum, and consistency of delivery can conspire to produce different outcomes with each replication attempt. Although these errors may be minimized by good experimental technique, they cannot be eliminated entirely. There are other sources of variation in the experiment that are more difficult to control. For example, mouse groups may differ, despite being matched by genetics, supplier, gender, and age, in such intangible areas as nutrition, stress, circadian rhythm, etc. Similarly, it is very difficult to prepare infectious inocula on different days that closely mirror one another given all the variables that contribute to microbial growth and virulence. To further complicate matters, the outcomes of complex processes such as infection and the host response do not often manifest simple dose-response relationships. Inherent stochasticity in biological processes ( 19 ) and anatomic or functional bottlenecks ( 2 ) provide additional sources of experiment-to-experiment variability. For many experiments reported in Infection and Immunity , the outcome of the experiment is highly dependent on initial experimental conditions, and small variations in the initial variables can lead to chaotic results. In such systems where exact replicability is difficult or impossible to achieve, the goal should be general reproducibility of the overall results. Ironically, results that are replicated too precisely are “too good to be true” and raise suspicions of data falsification ( 3 ), illustrating the tacit recognition that biological results inherently exhibit a degree of variation.

To continue the example given above, the conclusion that antibody was protective may be reproduced in subsequent experiments despite the fact that the precise initial result on average survival was never replicated, in the sense that subsequent experiments varied in magnitude of difference observed and time to death for the various groups. Investigators may be able to increase the likelihood that individual experiments are reproducible by enhancing their robustness. A well-known strategy to enhance the likelihood of reproducibility is to increase the power of the experiment by increasing the number of individual measurements, in order to minimize the contribution of errors or random effects. For example, using 10 mice per group in the aforementioned experiment is more likely to lead to reproducible results than using 3 mice, other things being equal. Along the same lines, two experiments using 10 mice each will provide more confidence in the robustness of the results than will a single experiment involving 20 animals, because obtaining similar results on different days lessens the likelihood that a given result was strongly influenced by an unrecognized variable on the particular day of the experiment. When reviewers criticize low power in experimental design, they are essentially worried that the effect of variable uncertainty on low numbers of measurements will adversely influence the reproducibility of the findings. However, subjective judgments based on conflicting values can influence the determination of sample size. For instance, investigators and reviewers are more likely to accept smaller sample sizes in experiments using primates. Consequently, a sample size of 3 might be acceptable in an experiment using chimpanzees while the same sample size might be regarded as unacceptable in a mouse experiment, even if the results in both cases achieve statistical significance. Similarly, cost can be a mitigating factor in determining the minimum number of replicates. For nucleic acid chip hybridization experiments, measurements in triplicate are recommended despite the complexity of such experiments and the range of variation inherent in such measurement, a recommendation that tacitly accepts the prohibitive cost of larger numbers of replicates for most investigators ( 5 ). Cost is also a major consideration in replicating transgenic or knockout mouse experiments in which mouse construction may take years. Hence, the power of an experiment can be estimated accurately using statistics, but real-life considerations ranging from the ethics of animal experimentation to monetary expense can influence investigator and reviewer judgment.

We cannot leave the subject of scientific reproducibility without acknowledging that questions about replicability and reproducibility have long been at the heart of philosophical debates about the nature of science and the line of demarcation between science and non-science. While scientists and reviewers demand evidence for the reproducibility of scientific findings, philosophers of science have largely discarded the view that scientific knowledge should meet the criterion that it is verifiable. Through inductive reasoning, Bacon used data to infer that under similar circumstances a result will be repeated and can be used to make generalizations about other related situations ( 11 ). However, the logical consistency of such views was challenged by Hume, who posited that inferences from experiences (or, in our case, experiments) cannot be assumed to hold in the future because the future may not necessarily be like the past. In other words, even the daily rising of the sun for millennia does not provide absolute assurance that it will rise the next day. The philosophies of logical positivism and verificationism viewed truth as reflecting the reproducibility of empirical experience, dependent on propositions that could be proven to be true or false. This was challenged by Popper, who suggested that a hypothesis could not be proven, only falsified or not, leaving open the possibility of a rare predictable exception, vividly depicted as the metaphor of a “black swan” ( 20 ). One million sightings of white swans cannot prove the hypothesis that all swans are white, but the hypothesis can be falsified by the sight of a single black swan.

A pragmatic approach to reproducibility.

Given the challenges of achieving and defining replicatibility and reproducibility in experimental science, what practical guidance can we provide? Despite valid concerns ranging from the true reproducibility of experimental science to the logical inconsistencies identified by philosophers of science, experimental reproducibility remains a standard and accepted criterion for publication. Hence, investigators must strive to obtain information with regard to the reproducibility of their results. That, in turn, raises the question of the number of replications needed for acceptance by the scientific community. The number of times that an experiment is performed should be clearly stated in a manuscript. A new finding should be reproduced at least once and preferably more times. However, even here there is some room for judgment under exceptional circumstances. Consider a trial of a new therapeutic molecule that is expected to produce a certain result in a primate experiment based on known cellular processes. If one were to obtain precisely the predicted result, one might present a compelling argument for accepting the results of the single experiment on moral grounds regarding animal experimentation, especially in situations in which the experiment results in injury or death to the animal. At the other extreme, when an experiment is easily and inexpensively carried out without ethical considerations, then it behooves the investigator to ascertain the replicability and reproducibility of a result as fully as possible. However, there are no hard and fast rules for the number of times that an experiment should be replicated before a manuscript is considered acceptable for publication. In general, the importance of reproducibility increases in proportion to the importance of a result, and experiments that challenge existing beliefs and assumptions will be subjected to greater scrutiny than those fitting within established paradigms.

Given that most experimental results reported in the literature will not be subjected to the test of precise replication unless the results are challenged, it is essential for investigators to make their utmost efforts to place only the most robust data into print, and this almost always involves a careful assessment of the variability inherent in a particular experimental protocol and the provision of information regarding the replicability of the results. In this instance, more is better than less. To ensure that research findings are robust, it is particularly desirable to demonstrate their reproducibility in the face of variations in experimental conditions. Reproducibility remains central to science, even as we recognize the limits of our ability to achieve absolute predictability in the natural world. Then again, ask us next week and you might get a different answer.

The views expressed in this Editorial do not necessarily reflect the views of the journal or of ASM.

Editor: A. Camilli

▿ Published ahead of print on 27 September 2010.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

How many times should an experiment be repeated?

I am doing an experiment as part of a school project. In order to decrease the random error I repeat the measurements.

How to define if I have made enough tries? Should it be 10? Or 20? Mathematically speaking the more tries I have done the better the precision is, however, this way I need to repeat the measurements an infinite number of times.

  • experimental-physics
  • error-analysis

Qmechanic's user avatar

  • $\begingroup$ The ideal is 3 times or more $\endgroup$ –  QuIcKmAtHs Commented Dec 29, 2017 at 15:55
  • $\begingroup$ Trials and Experiments $\endgroup$ –  user179430 Commented Dec 29, 2017 at 16:06
  • 2 $\begingroup$ The answer depends on what you're measuring. Giving some details about your experiment would help. $\endgroup$ –  lemon Commented Dec 29, 2017 at 17:27
  • 1 $\begingroup$ the answer also depends on a characteristic of the measuring device, called "gauge capability". this has to do with how accurately and repeatably your measuring device does its job. Knowing the capability of your gauge allows you to determine whether differences in your measurements are dominated by flaws in the gauge rather than real differences between your experimental measurements. My own rule-of-thumb is 5 measurements are suggestive, 10 are data, and 50 are information- but this assumes a "capable" gauge. $\endgroup$ –  niels nielsen Commented Dec 29, 2017 at 20:39

2 Answers 2

The answer depends on the degree of accuracy needed, and how noisy the measurements are. The requirements are set by the task (and your resources, such as time and effort), the noisiness depends on the measurement method (and perhaps on the measured thing, if it behaves a bit randomly).

For normally distributed errors (commonly but not always true), if you do $N$ independent measurements $x_i$ where each measurement error is normally distributed around the true mean $\mu$ with a standard error $\sigma$: you get an estimated mean by averaging your measurements $\hat{\mu}=(1/N)\sum_i x_i$. The neat thing is that the error in the estimate declines as you make more measurements, as $$\sigma_{mean}=\frac{\sigma}{\sqrt{N}}.$$ So if you knew that the standard error $\sigma$ was (say) 1 and you wanted a measurement that had a standard error 0.1, you can see that having $N=100$ would bring you down to that level of precision. Or, if $\delta$ is the desired accuracy, you need to make $\approx (\sigma/\delta)^2$ tries.

But when starting you do not know $\sigma$. You can get an estimate of the standard error of your measurements $\hat{\sigma}=\sqrt{\frac{1}{N-1}\sum_i (x_i-\hat{\mu})^2}$. This is a noisy result, since it is all based on your noisy measurements - if everything has gone right it is somewhere in the vicinity of the true $\sigma$, and you can use further statistical formulas to bound how much in error you might be in the error of your estimate. There are lots of annoying/interesting/subtle issues here that fill statistics courses.

In practice, for a school project : define how you make your measurements beforehand, make 10 or more, calculate the mean and standard error, and look at the data you have (this last step is often missed even by professional data scientists!) If the data is roughly normally distributed most measurements should be bunched up with a few outliers that are larger and smaller, and about half should be below the mean and half above. If you want to be cautious, check that the median (the middlemost data point) is close to the mean.

If the data is pretty normal, estimate how many tries you need and do them.

If the data does not look normal - very remote outliers, clumps away from the mean, skew (more high or low data points) - then the above statistics is suspect. Calculating means and standard errors still make sense and can/should be reported, but the formula for the accuracy will not be accurate. In cases like this it is often best to make a lot of measurements and in the report show the distribution of results to get a sense of the accuracy.

Things to look out for that this will not fix : biased measurements (whether that is due to always rounding up, always measuring from one side with a ruler, a thermometer that shows values slightly too high), too crude measurements, calculation errors (embarassingly common even in published science), errors in the experimental setup (are you really measuring what you want to measure?) and model errors (are you thinking about the problem in the right way?) No amount of statistics will fix this, but some planning and experimentation may help reduce the risk. Biased measurements can be corrected by checking that you get the right results for known cases and/or callibrating the device. Having two or more ways of measuring or calculating is a great sanity check. Experimental setup and model errors can be corrected by listening to annoying critics (who you can then magnanimously thank in your acknowledgement section).

Anders Sandberg's user avatar

Pick a number, let's say ten. Record your measurements. Determine the mean. Determine the standard deviation. Determine the standard error. Mean +/- 2*standard error will give you a 95% certainty that your mean is accurate.

Doing a chi squared test will determine if your data distribution is acceptable.

If standard error is too high then do more trials to reduce the error. If chi squared is off then it indicates your data is skewed which likely means there's some error in your measurement process. Correct that and try again.

Bigjoemonger's user avatar

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged experimental-physics error-analysis statistics or ask your own question .

  • Featured on Meta
  • Bringing clarity to status tag usage on meta sites
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • Topology on a module over a topological ring
  • What to do when 2 light switches are too far apart for the light switch cover plate?
  • What did the Ancient Greeks think the stars were?
  • What is the name of the book about a boy dressed in layers of clothes who is actually a mouse?
  • `Drop` for list of elements of different dimensions
  • Can't see parts of a wall when viewed through a glass panel (shower cabin) from a top view angle
  • wp_verify_nonce is always false even when the nonces are identical
  • Why is one of the Intel 8042 keyboard controller outputs inverted?
  • I'm trying to remember a novel about an asteroid threatening to destroy the earth. I remember seeing the phrase "SHIVA IS COMING" on the cover
  • My colleagues and I are travelling to UK as delegates in an event and the company is paying for all our travel expenses. what documents are required
  • How would increasing atomic bond strength affect nuclear physics?
  • How would you say a couple of letters (as in mail) if they're not necessarily letters?
  • Historical U.S. political party "realignments"?
  • A string view over a Java String
  • Why does Russia strike electric power in Ukraine?
  • Exact time point of assignment
  • Running different laser diodes using a single DC Source
  • Dress code for examiner in UK PhD viva
  • Passport Carry in Taiwan
  • How much missing data is too much (part 2)? statistical power, effective sample size
  • Why does the NIV have differing versions of Romans 3:22?
  • What are the risks of a compromised top tube and of attempts to repair it?
  • 3 Aspects of Voltage that contradict each other
  • How can you trust a forensic scientist to have maintained the chain of custody?

repeats in an experiment

IMAGES

  1. Replication and Repetition Summative Practice Diagram

    repeats in an experiment

  2. Scientific Method by Kate Vernier

    repeats in an experiment

  3. Repeat the Experiment!

    repeats in an experiment

  4. Example of multiple repeats of the experiment in a single subject. (A

    repeats in an experiment

  5. Solved A scientist repeats the Millikan oil drop experiment

    repeats in an experiment

  6. The structural features of CRISPR. The repeat sequences with constant

    repeats in an experiment

COMMENTS

  1. Replicates and repeats—what is the difference and is it significant?

    The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline-treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check the ...

  2. Replicates and repeats in designed experiments

    Quality engineers design two experiments, one with repeats and one with replicates, to evaluate the effect of the settings on quality. The first experiment uses repeats. The operators set the factors at predetermined levels, run production, and measure the quality of five products. They reset the equipment to new levels, run production, and ...

  3. Repetition vs Replication: Key Differences

    Repetition vs Replication. Replication assesses whether the same experiment yields consistent results across different trials or conditions, ensuring external validity and generalizability.; Repetition focuses on obtaining multiple measurements within the same experiment or closely related experiments to assess precision and internal consistency.; In essence, replication examines consistency ...

  4. What are Replicates? A Complete Guide

    Multiple repeats are generally averaged and may also include the standard deviation. Repeats do not add additional degrees of freedom. The number of replicates needed in an experiment will depend on the level of precision required, the expected variability of the response, and the resources available.

  5. Replication (statistics)

    Replication (statistics) In engineering, science, and statistics, replication is the process of repeating a study or experiment under the same or similar conditions to support the original claim, which crucial to confirm the accuracy of results as well as for identifying and correcting the flaws in the original experiment. [ 1]

  6. Replication

    For example, if the cost units of animals to cells to measurements is 10:1:0.1 (biological replicates are likely more expensive than technical ones) then an experiment with n A,n C,n M of 12,12,1 ...

  7. What is Science?: Repeat and Replicate

    Replication. Once we have repeated our testing over and over, and think we understand the results, then it is time for replication. That means getting other scientists to perform the same tests, to see whether they get the same results. As with repetition, the most important things to watch for are results that don't fit our hypothesis, and for ...

  8. (PDF) Replicates and repeats—what is the difference and is it

    Each test was repeated three times. Tis replication of the data was to help serve as an internal quality check on how the experiment was performed [12, 13]. Te mean of the triplicate tests was ...

  9. Types of Replicates: Technical vs. Biological

    Biological replicates derived from independent samples help capture random biological variation. For example, three biological replicates (A, B, and C) are collected from three independent mice. Each of these biological replicates was run in three technical replicates (A1, A2, A3; B1, B2, B3; C1, C2, C3) in one Western blot assay.

  10. 5 Replicability

    We repeat our definition of replicability, with emphasis added: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. Determining consistency between two different results or inferences can be approached in a number of ways ( Simonsohn, 2015 ; Verhagen and ...

  11. Replicating scientific results is tough

    Replicabillity — the ability to obtain the same result when an experiment is repeated — is foundational to science. But in many research fields it has proved difficult to achieve. An important ...

  12. Repeat is an important tool in science

    In science, repetition allows researchers to confirm or refute their hypotheses. It is a fundamental principle of the scientific method, requiring repeatable and reproducible experiments. By repeating an experiment, scientists can test the validity of their results and ensure that they are not a fluke or a random occurrence. Repetition is also ...

  13. Experimental Design: Types, Examples & Methods

    Three types of experimental designs are commonly used: 1. Independent Measures. Independent measures design, also known as between-groups, is an experimental design where different participants are used in each condition of the independent variable. This means that each condition of the experiment includes a different group of participants.

  14. Why is Replication in Research Important?

    Replication in research is important because it allows for the verification and validation of study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims. Updated on June 30, 2023.

  15. 11: Introduction to Repeated Measures

    Repeated measures in time are the type in which experimental units receive treatment, and they are simply followed with repeated measures on the response variable over several times. In contrast, experiments can involve administering all treatment levels (in a sequence) to each experimental unit. This type of repeated measures study is called a ...

  16. experiment design

    14. The following quote is from Montgomery's Experimental Design: There is an important distinction between replication and repeated measurements. For example, suppose that a silicon wafer is etched in a single-wafer plasma etching process, and a critical dimension on this wafer is measured three times. These measurements are not replicates ...

  17. Improving Experimental Precision with Replication: A ...

    Improving Experimental Precision with Replication: A Comprehensive Guide. Updated: June 21, 2023 by Ken Feldman. Replication is the non consecutive running of the experimental design multiple times. The purpose is to provide additional information and degrees of freedom to better understand and estimate the variation in the experiment.

  18. Replicates and repeats—what is the difference and is it significant?:

    The answer, of course, is 'no'. Replicates serve as internal quality checks on how the experiment was performed. If, for example, in the experiment described in Table 1 and Fig 1, one of the replicate plates with saline‐treated WT bone marrow contained 100 colonies, you would immediately suspect that something was wrong. You could check ...

  19. Increasing the Ability of an Experiment to Measure an Effect

    Repeating an experiment more than once helps determine if the data was a fluke, or represents the normal case. It helps guard against jumping to conclusions without enough evidence. The number of repeats depends on many factors, including the spread of the data and the availability of resources.

  20. How Many Times Should an Experiment be Replicated?

    Abstract. An important decision in any problem of experimental design is to determine how many times an experiment should be replicated. Note that we are referring to the internal replication of an experiment. Generally, the more it is replicated, the more accurate the results of the experiment will be. However, resources tend to be limited ...

  21. Reproducible Science

    We agree with the need to repeat experiments but suggest that authors strive for reproducibility instead of simple replicability. For example, consider an experiment in which a particular variable, the level of a specific antibody, is believed to account for a specific experimental outcome, resistance to a microbial pathogen.

  22. How many times should an experiment be repeated?

    The ideal is 3 times or more. - QuIcKmAtHs. Dec 29, 2017 at 15:55. Trials and Experiments. - user179430. Dec 29, 2017 at 16:06. 2. The answer depends on what you're measuring. Giving some details about your experiment would help.