SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Reproducibility of Scientific Results

The terms “reproducibility crisis” and “replication crisis” gained currency in conversation and in print over the last decade (e.g., Pashler & Wagenmakers 2012), as disappointing results emerged from large scale reproducibility projects in various medical, life and behavioural sciences (e.g., Open Science Collaboration, OSC 2015). In 2016, a poll conducted by the journal Nature reported that more than half (52%) of scientists surveyed believed science was facing a “replication crisis” (Baker 2016). More recently, some authors have moved to more positive terms for describing this episode in science; for example, Vazire (2018) refers instead to a “credibility revolution” highlighting the improved methods and open science practices it has motivated.

The crisis often refers collectively to at least the following things:

  • the virtual absence of replication studies in the published literature in many scientific fields (e.g., Makel, Plucker, & Hegarty 2012),
  • widespread failure to reproduce results of published studies in large systematic replication projects (e.g., OSC 2015; Begley & Ellis 2012),
  • evidence of publication bias (Fanelli 2010a),
  • a high prevalence of “questionable research practices”, which inflate the rate of false positives in the literature (Simmons, Nelson, & Simonsohn 2011; John, Loewenstein, & Prelec 2012; Agnoli et al. 2017; Fraser et al. 2018), and
  • the documented lack of transparency and completeness in the reporting of methods, data and analysis in scientific publication (Bakker & Wicherts 2011; Nuijten et al. 2016).

The associated open science reform movement aims to rectify conditions that led to the crisis. This is done by promoting activities such as data sharing and public pre-registration of studies, and by advocating stricter editorial policies around statistical reporting including publishing replication studies and statistically non-significant results.

This review consists of four distinct parts. First, we look at the term “reproducibility” and related terms like “repeatability” and “replication”, presenting some definitions and conceptual discussion about the epistemic function of different types of replication studies. Second, we describe the meta-science research that has established and characterised the reproducibility crisis, including large scale replication projects and surveys of questionable research practices in various scientific communities. Third, we look at attempts to address epistemological questions about the limitations of replication, and what value it holds for scientific inquiry and the accumulation of knowledge. The fourth and final part describes some of the many initiatives the open science reform movement has proposed (and in many cases implemented) to improve reproducibility in science. In addition, we reflect there on the values and norms which those reforms embody, noting their relevance to the debate about the role of values in the philosophy of science.

1.1 An Account from the Social Sciences

1.2 an interdisciplinary account, 1.3 a philosophical account, 2.1 reproducibility projects, 2.2 publication bias, low statistical power and inflated false positive rates, 2.3 questionable research practices, 2.4 over-reliance on null hypothesis significance testing, 2.5 scientific fraud, 3.1 the experimenters’ regress, 3.2 replication as a distinguishing feature of science, 3.3 formalising the logic of replication, 4.1 methods and training, 4.2 reporting and dissemination, 4.3 peer review, 4.4 incentives and evaluations, 4.5 values, tone, and scientific norms in open science reform, 5. conclusion, other internet resources, related entries, 1. replicating, repeating, and reproducing scientific results.

A starting point in any philosophical exploration of reproducibility and related notions is to consider the conceptual question of what such notions mean. According to some (e.g., Cartwright 1991), the terms “replication”, “reproduction” and “repetition” denote distinct concepts, while others use these terms interchangeably (e.g., Atmanspacher & Maasen 2016a). Different disciplines can have different understandings of these terms too. In computational disciplines, for example, reproducibility often refers to the ability to reproduce computations alone, that is, it relates exclusively to sharing and sufficiently annotating data and code (e.g., Peng 2011, 2015). In those disciplines, replication describes the redoing of whole experiments (Barba 2017, Other Internet Resources). In psychology and other social and life sciences, however, reproducibility may refer to either the redoing of computations, or the redoing of experiments. The Reproducibility Projects, coordinated by the Center for Open Science, redo entire studies, data collection and analysis. A recent funding program announcement by DARPA (US Defense Advanced Research Programs Agency) distinguished between reproducibility and replicability, where the former refers to computational reproducibility and the latter to the redoing of experiments. Here we use all three terms—“replication”, “reproduction” and “repetition”—interchangeably, unless explicitly describing the distinctions of other authors.

When describing a study as “replicable”, people could have in mind either of at least two different things. The first is that the study is replicable in principle the sense that it can be carried out again, particularly when its methods, procedures and analysis are described in a sufficiently detailed and transparent way. The second is that the study is replicable in that sense that it can be carried out again and, when this happens, the replication study will successfully produce the same or sufficiently similar results as the original. A study may be replicable in the former sense but not in the second sense: one might be able to replicate the methods, procedures and analysis of a study, but fail to successfully replicate the results of the original study. Similarly, when people talk of a “replication”, they could also have in mind two different things: the replication of the methods, procedures and analysis of a study (irrespective of the results) or, alternatively, the replication of such methods, procedures and analysis as well as the results.

Arguably, most typologies of replication make more or less fine-grained distinctions between direct replication (which closely follow the original study to verify results) and conceptual replications (which deliberately alter important features of the study to generalize findings or to test the underlying hypothesis in a new way). As suggested, this distinction may not always be known by these terms. For example, roughly the same distinction is referred to as exact and inexact replication by Keppel (1982); concrete and conceptual replication by Sargent (1981), and literal, operational and constructive replication by Lykken (1968). Computational reproducibility is most often direct (reproducing particular analysis outcomes from the same data set using the same code and software), but it can also be conceptual (analysing the same raw data set with alternative approaches, different models or statistical frameworks). For an example of a conceptual computational reproducibility study, see Silberzahn and Uhlmann 2015.

We do not attempt to resolve these disciplinary differences or to create a new typology of replication, and instead we will provide a limited snapshot of the conceptual terrain by surveying three existing typologies—from Stefan Schmidt (2009), from Omar Gómez, Natalia Juristo, and Sira Vegas (2010) and from Hans Radder. Schmidt’s account has been influential and widely-cited in psychology and social sciences, where the replication crisis literature is heavily concentrated. Gómez, Juristo, and Vegas’s (2010) typology of replication is based on a multidisciplinary survey of over 18 scholarly classifications of replication studies which collectively contain more than 79 types of replication. Finally, Radder’s (1996, 2003, 2006, 2009, 2012) typology is perhaps best known within philosophy of science itself.

Schmidt outlines five functions of replication studies in the social sciences:

  • Function 1. Controlling for sampling error—that is, to verify that previous results in a sample were not obtained purely by chance outcomes which paint a distorted picture of reality
  • Function 2. Controlling for artifacts (internal validity)—that is, ensuring that experimental results are a proper test of the hypothesis (i.e., have internal validity) and do not reflect unintended flaws in the study design (such as when a measurement result is, say, an artifact of a faulty thermometer rather than an actual change in a substance’s temperature)
  • Function 3. Controlling for fraud,
  • Function 4. Enabling generalizability,
  • Function 5. Enabling verification of the underlying hypothesis.

Modifying Hendrik’s (1991) classes of variables that define a research space, Schmidt (2009) presents four classes of variables which may be altered or held constant in order for a given replication study to fulfil one of the above functions. The four classes are:

  • Class 1. Information conveyed to participants (for example, their task instructions).
  • Class 2. Context and background. This is a large class of variables, and it includes: participant characteristics (e.g., age, gender, specific history); the physical setting of the research; characteristics of the experimenter; incidental characteristics of materials (e.g., type of font, colour of the room),
  • Class 3. Participant recruitment, including selection of participants and allocation to conditions (such as experimental or control conditions),
  • Class 4. Dependent variable measures (or in Schmidt’s terms “procedures for the constitution of the dependent variable”, 2009: 93)

Schmidt then systematically works through examples of how each function can be achieved by altering and/or holding a different class or classes of variable constant. For example, to fulfil the function of controlling for sampling error ( Function 1 ), one should alter only variables regarding participant recruitment (Class 3), attempting to keep variables in all other classes as close to the original study as possible. To control for artefacts ( Function 2 ), one should alter variables concerning the context and dependent variable measures (variables in Classes 2 and 4 respectively), but keep variables in 1 and 3 (information conveyed to participants and participant recruitment) as close to the original as possible. Schmidt, like most other authors in this area, acknowledges the practical limits of being able to hold all else constant. Controlling for fraud ( Function 3 ) is served by the same arrangements as controlling for artefacts ( Function 2 ). In Schmidt’s account, controlling for sampling error, artefacts and fraud (Functions 1 to 3) are connected by a theme of confirming the results of the original study. Functions 4 and 5 go beyond this—generalizing to new populations ( Function 4 ) which is served by changes to participant recruitment (Class 3) and confirming the underlying hypothesis ( Function 5 ), which served by changes to the information conveyed, the context and dependant variable measures (Classes 1, 2 and 4 respectively) but not changes to participant recruitment (Class 3, although Schmidt acknowledges that holding the latter class of variables constant whilst varying everything else is often practically impossible). Attempts to enable verification of the underlying research hypothesis (i.e., to fulfil Function 5) alone are what Schmidt classifies as conceptual replications, following Rosenthal (1991). Attempts to fulfil the other four functions are considered variants of direct replications.

In summary, for Schmidt, direct replications control for sampling error, artifacts, and fraud, and provide information about the reliability and validity of prior empirical work. Conceptual replications help corroborate the underlying theory or substantive (as opposed to statistical) hypothesis in question and the extent to which they generalize in new circumstances and situations. In practice, direct and conceptual replications lie on a continuum, with replication studies varying more or less compared to the original on potentially a great number of dimensions.

Gómez, Juristo, and Vega’s (2010) survey of the literature in 18 disciplines identified 79 types of replication, not all of which they considered entirely distinct. They identify five main ways in which a replication study may diverge from an initial study. With some similarities to Schimdt’s four classes above:

  • The site or spatial location of the replication experiment: replication experiments may be conducted in a location that is or is not the same as the site of the initial study.
  • The experimenters conducting a replication may be exclusively the same as the original, exclusively different, or a combination of new and original experimenters
  • The apparatus , including the design, materials, instruments and other important experimental objects and/or procedures may vary between original and replication studies.
  • The operationalisations employed may differ, where operationalisation refers to measurement of variables. For example, in psychology this might include using two different scales measuring for depression (as a dependent variable).
  • Finally, studies may vary on population properties .

A change in any one or combination of these elements in a replication study corresponds to different purposes underlying the study, and thereby establishes a different kind of validity. Like Schmidt et al. then systematically work through how changes to each of the above work to fulfil different epistemic functions.

  • Function 1. Conclusion Validity and Controlling for Sampling Error : If each of the five elements above are unchanged in a replication study, then the purpose of the replication is to control for sampling error , that is, to verify that previous results in a sample were not obtained purely by chance outcomes which make the sample misleading or unrepresentative. This provides a safeguard against what is known as a type I error : incorrectly failing to reject the null hypothesis (that is, the hypothesis that there is no relationship between two phenomena under investigation). These studies establish conclusion validity , that is, the credibility or believability of an observed relationship or phenomenon.
  • Function 2. Internal Validity and Controlling for Artefactual Results : If a replication study differs with respect to the site, experimenters or apparatus, then its purpose is to establish that previously observed results are not an artefact of a particular apparatus, lab or so on. These studies establish internal validity , that is, the extent to which results can be attributed to the experimental manipulation itself rather than to extraneous variables.
  • Function 3. Construct Validity and Determining Limits for Operationalizations : If a replication study differs with respect to operationalisations, then its purpose is to determine the extent to which the effect generalizes across measures of manipulated or dependent variables (e.g., the extent to which the effect does not depend on the particular psychometric test one uses to evaluate depression or IQ). Such studies fulfil the function of establishing construct validity in that they provide evidence that the effect holds across different ways of measuring the constructs.
  • Function 4. External Validity and Determining Limits in the Population Properties : If a replication study differs with respect to its population properties, then its purpose is to ascertain the extent to which the results are generalizable to different populations, populations which, in Gómez, Juristo, and Vegas’s view, concern subjects and experimental objects such as programs. Such studies reinforce external validity —the extent to which the results are generalizable to different populations.

Radder (1996, 2003, 2006, 2009, 2012) distinguishes three types of reproducibility. One is the reproducibility of what Radder calls an experiment’s material realization . Using one of Radder’s own examples as an illustration, two people may carry out the same actions to measure the mass of an object. Despite doing the same actions, person A regards themselves as measuring the object’s Newtonian mass while person B regards themselves as measuring the object’s Einsteinian mass. Here, then, the actions or material realization of the experimental procedure can be reproduced, but the theoretical descriptions of their significance differ. Radder, however, does not specify what is required for one material realisation to be a reproduction of another, a pertinent question, especially since, as Radder himself affirms, no reproduction will be exactly the same as any other reproduction (1996: 82–83).

A second type of reproducibility is the reproducibility of an experiment, given a fixed theoretical description . For example, a social scientist might conduct two experiments to examine social conformity. In one experiment, a young child might be instructed to give an answer to a question before a group of other children who are, unknown to the former child, instructed to give wrong answers to the same question. In another experiment, an adult might be instructed to give an answer to a question before a group of other adults who are, unknown to the former adult, instructed to give wrong answers to the same question. If the child and the adult give a wrong answer that conforms to the answers of others, then the social scientist might interpret the result as exemplifying social conformity. For Radder, the theoretical description of the experiment might be fixed, specifying that if some people in a participant’s surroundings give intentionally false answers to the question, then the genuine participant will conform to the behaviour of their peers. However, the material realization of these experiments differs insofar as one concerns children and the other adults. It is difficult to see how, in this example at least, this differs from what either Schmidt or Gómez, Juristo, and Vegas would refer to as establishing generalizability to a different population (Schmidt’s [2009] Class 3 and Function 5 ; Gómez, Juristo, and Vegas’s [2010] way 5 and Function 4 ).

The third kind of reproducibility is what Radder calls replicability . This is where experimental procedures differ to produce the same experimental result (otherwise known as a successful replication). For example, Radder notes that multiple experiments might obtain the result “a fluid of type f has a boiling point b ”, despite having different kinds of thermometers by which to measure this boiling point (2006: 113–114).

Schmidt (2009) points out that the difference between Radder’s second and third types of reproducibility is small in comparison to their differences to the first type. He consequently suggests his alternative distinction between direct and conceptual replication, presumably intending a conceptual replication to cover Radder’s second and third types.

In summary, whilst Gómez, Juristo, and Vegas’s typology draws distinctions in slightly different places to Schmidt’s, its purpose is arguably the same—to explain what types of alterations in replication studies fulfil different scientific goals, such as establishing internal validity or the extent of generalization and so on. With the exception of his discussion of reproducing the material realization, Radder’s other two categories can perhaps be seen as fitting within the larger range of functions described by Schmidt and Gómez et al., who both acknowledge that in practice, direct and conceptual replications lie on a noisy continuum.

2. Meta-Science: Establishing, Monitoring, and Evaluating the Reproducibility Crisis

In psychology, the origin of the reproducibility crisis is often linked to Daryl Bem’s (2011) paper which reported empirical evidence for the existence of “psi”, otherwise known as Extra Sensory Perception (ESP). This paper passed through the standard peer review process and was published in the high impact Journal of Personality and Social Psychology. The controversial nature of the findings inspired three independent replication studies, each of which failed to reproduce Bem’s results. However, these replication studies were rejected from four different journals, including the journal that had originally published Bem’s study, on the grounds that the replications were not original or novel research. They were eventually published in PLoS ONE (Ritchie, Wiseman, & French 2012). This created controversy in the field, and was interpreted by many as demonstrating how publication bias impeded science’s self-correction mechanism. In medicine, the origin of the crisis is often attributed to Ioannidis’ (2005) paper “Why most published findings are false”. The paper offered formal arguments about inflated rates of false positives in the literature—where a “false positive” result claims a relationship exists between phenomena when it in fact does not (e.g., a claim that consuming a drug is correlated with symptom relief when it in fact is not). Ioannidis’ (2005) also reported very low (11%) empirical reproducibility rates from a set of pre-clinical trial replications at Amgen, later independently published by Begley and Ellis (2012). In all disciplines, the replication crisis is also more generally linked to earlier criticisms of Null Hypothesis Significance Testing (e.g., Szucs & Ioannidis 2017), which pointed out the neglect of statistical power (e.g., Cohen 1962, 1994) and a failure to adequately distinguish statistical and substantive hypotheses (e.g., Meehl 1967, 1978). This is discussed further below.

In response to the events above, a new field identifying as meta-science (or meta-research ) has become established over the last decade (Munafò et al. 2017). Munafò et al. define meta-science as “the scientific study of science itself” (2017: 1). In October 2015, Ioannidis, Fanelli, Dunne, and Goodman identified over 800 meta-science papers published in the five-month period from January to May that year, and estimated that the relevant literature was accruing at the rate of approximately 2,000 papers each year. Referring to the same bodies of work with slightly different terms, Ioannidis et al. define “meta-research” as

an evolving scientific discipline that aims to evaluate and improve research practices. It includes thematic areas of methods, reporting, reproducibility, evaluation, and incentives (how to do, report, verify, correct, and reward science). (2015: 1)

Multiple research centres dedicated to this work now exist, including, for example, the Tilburg University Meta-Research Center in psychology, the Meta-Research Innovation Center at Stanford (METRICS), and others listed in Ioannidis et al. 2015 (see Other Internet Resources ). Relevant research in medical fields is also covered in Stegenga 2018.

Projects that self-identify as meta-science or meta-research include:

  • Large, crowd-sourced, direct (or close) replication projects such as The Reproducibility Projects in Psychology (OSC 2015) and Cancer Biology (Errington et al. 2014) and the Many Labs projects in psychology (e.g., Klein et al. 2014);
  • Computational reproducibility projects, that is, redoing analysis using the same original data set (e.g., Chang & Li 2015);
  • Bibliographic studies documenting the extent of publication bias in different scientific fields and changes over time (e.g., Fanelli 2010a, 2010b, 2012);
  • Surveys of the use of Questionable Research Practices (QRPs) amongst researchers and their impact on the publication literature (e.g., John, Loewenstein, & Prelec 2012; Fiedler & Schwarz 2016; Agnoli et al. 2017; Fraser et al. 2018);
  • Surveys of the completeness, correctness and transparency of methods and analysis reporting in scientific journals (e.g., Nuijten et al. 2016; Bakker & Wicherts 2011; Cumming et al. 2007; Fidler et al. 2006);
  • Survey and interview studies of researchers’ understanding of core methodological and statistical concepts, and real and perceived obstacles to improving practices (Bakker et al. 2016; Washburn et al. 2018; Allen, Dorozenko, & Roberts 2016);
  • Evaluation of incentives to change behaviour, thereby improving reproducibility and encouraging more open practices (e.g., Kidwell et al. 2016).

The most well known of these projects is undoubtedly the Reproducibility Project: Psychology, coordinated by the now Center for Open Science in Charlottesville, VA (then the Open Science Collaboration). It involved 270 crowd sourced researchers in 64 different institutions in 11 different countries. Researchers attempted direct replications of 100 studies published in three leading psychology journals in the year 2008. Each study was replicated only once. Replications attempted to follow original protocols as closely as possible, though some differences were unavoidable (e.g., some replication studies were done with European samples when the original studies used US samples). In almost all cases, replication studies used larger sample sizes that the original studies and therefore had greater statistical power—that is, a greater probability of correctly rejecting the null hypothesis (i.e., that no relationship exists) when the hypothesis is false. A number of measures of reproducibility were reported:

  • The proportion of studies in which there was a match in the statistical significance between original and replication. (Here, the statistical significance of a result is the probability that it would occur given the null hypothesis, and p values are common measures of such probabilities. A replication study and an original study would have a match in statistical significance if, for example, they both specified that the probability of the original and replication results occurring given the null hypothesis is less than 5%—i.e., if the p values for results in both studies are below 0.05.) Thirty nine percent (36%) of results were successful reproduced according to this measure.
  • The proportion of studies in which the Effect Size (ES) of the replication study fell within the 95% Confidence Interval (CI) of the original. (Here, an ES represents the strength of a relationship between phenomena—a toy example of which is how strongly consumption of a drug is correlated with symptom relief—and a Confidence Interval provides some indication of the probability that the ES of the replication study is close to the ES of the original study.) Forty seven percent (47%) of results were successfully reproduced according to this measure.
  • The correlation between original ES and replication ES. Replication study ESs were roughly half the size of original ESs.
  • The proportion of studies for which subjective ratings by independent researchers indicated a match between the replication and the original. Thirty nine percent (39%) were considered successful reproductions according to this measure. The closeness of this figure to measure 1 suggests that raters relied very heavily on p values in making their judgements.

There have been objections to the implementation and interpretation of this project, most notably by Gilbert et al. (2016), who took issue with the extent to which the replications studies were indeed direct replications. For example, Gilbert et al. highlighted 6 specific examples of “low fidelity protocols”, that is, where replication studies differed in their view substantially from the original (in one case, using a European sample rather than a US sample of participants). However, Anderson et al. (2016) explained in a reply that in half of those cases, the authors of the original study had endorsed the replication as being direct or close to on relevant dimensions and that furthermore, that independently rated similarity between original and replication studies failed to predict replication success. Others (e.g., Etz & Vandekerckhove 2016) have applied Bayesian reanalysis to the OSC’s (2015) data and conclude that up to 75% (as opposed to the OSC’s 36–47%) of replications could be considered successful. However, they do note that in many cases this is only with very weak evidence (i.e., Bayes factors of <10). They too conclude that the failure to reproduce many effects is indeed explained by the overestimation of effect sizes, itself a product of publication bias. A Reproducibility Project: Cancer Biology (also coordinated by the Center for Open Science) is currently underway (Errington et al. 2014), originally attempting to replicate 50 of the highest impact studies in Cancer Biology published between 2010–2012. This project has recently announced it will complete with only 18 replication studies, as too few originals reported enough information to proceed with full replications (Kaiser 2018). Results of the first 10 studies are reportedly mixed, with only 5 being considered “mostly repeatable” (Kaiser 2018).

The Many Labs project (Klein et al. 2014) coordinated 36 independent replications of 13 classic psychology phenomena (from 12 studies, that is, one study tested two effects), including anchoring, sunk cost bias and priming, amongst other well-known effects in psychology. In terms of matching statistical significance, the project demonstrated that 11 out of 13 effects could be successful replicated. It also showed great variation in many of the effect sizes across the 36 replications.

In biomedical research, there have also been a number of large scale reproducibility projects. An early one by Begley and Ellis (2012, but discussed earlier in Ioannidis 2005) attempted to replicate 56 landmark pre-clinical trials and reported an alarming reproducibility rate of only 11%, that is, only 6 of the 56 results could be successfully reproduced. Subsequent attempts at large scale replications in this field have produced more optimistic estimates, but routinely failed to successfully reproduce more than half of the published results. Freedman et al. (2015) report five replication projects by independent groups of researchers which produce reproducibility estimates ranging from 22% to 49%. They estimate the cost of irreproducible research in US biomedical science alone to be in the order of USD$28 billion per year. A reproducibility project in Experimental Philosophy is an exception to the general trend, reporting reproducibility rates of 70% (Cova et al. forthcoming).

Finally, the Social Science Replication Project (SSRP) redid 21 experimental social science studies published in the journals Nature and Science between 2010 and 2015. Depending on the measure taken, the replication success rate was 57–67% (Camerer et al. 2018).

The causes of irreproducible results are largely the same across disciplines we have mentioned. This is not surprising given that they stem from problems with statistical methods, publishing practices and the incentive structures created in a “publish or perish” research culture, all of which are largely shared, at least in the life and behavioral sciences.

Whilst replication is often casually referred to as a cornerstone of the scientific method, direct replication studies (as they might be understood from Schmidt or Gómez, Juristo, and Vegas’s typologies above) are a rare event in the published literature of some scientific disciplines, most notably the life and social sciences. For example, such replication attempts constitute roughly 1% of the published psychology literature (Makel, Plucker, & Hegarty 2012). The proportion in published ecology and evolution literature is even smaller (Kelly 2017, Other Internet Resources).

This virtual absence of replication studies in the literature can explained by the fact that many scientific journals have historically had explicit policies against publishing replication studies (Mahoney 1985)—thus giving rise to a “publication bias”. Over 70% of editors from 79 social science journals said they preferred new studies over replications and over 90% said they would did not encourage the submission of replication studies (Neuliep & Crandall 1990). In addition, many science funding bodies also fund only “novel”, “original” and/or “groundbreaking” research (Schmidt 2009).

A second type of publication bias has also played a substantial role in the reproducibility crisis, namely a bias towards “statistically significant” or “positive” results. Unlike the bias against replication studies, this is rarely an explicitly stated policy of a journal. Publication bias towards statistically significant findings has a long history, and was first documented in psychology by Sterling (1959). Developments in text mining techniques have led to more comprehensive estimates. For example, Fanelli’s work has demonstrated the extent of publication bias in various disciplines, and the proportions of statistically significant results given below are from his 2010a paper. He has also documented the increase of this bias over time (2012) and explored the causes of the bias, including the relationship between publication bias and a publish or perish research culture (2010b).

In many disciplines (e.g., psychology, psychiatry, materials science, pharmacology and toxicology, clinical medicine, biology and biochemistry, economics and business, microbiology and genetics) the proportion of statistically significant results is very high, close to or exceeding 90% (Fanelli 2010a). This is despite the fact that in many of these fields, the average statistical power is low—that is, the average probability that a study will correctly reject the null hypothesis is low. For example, in psychology the proportion of published results that are statistically significant is 92% despite the fact that the average power of studies in this field to detect medium effect sizes (arguably typical of the discipline) is roughly 44% (Szucs & Ioannidis 2017). If there was no bias towards publishing statistically significant results, the proportion of significant results should roughly match the average statistical power of the discipline. The excess in statistical significance (in this case, the difference between 92% and 44%) is therefore an indicator the strength of the bias. For a second example, in ecology, environment and plant and animal sciences the proportion of statistically significant results is 74% and 78% respectively, admittedly lower than in psychology. However, the most recent estimate of the statistical power, again of medium effect sizes, of ecology and animal behaviour is 23–26% (Smith, Hardy, & Gammell 2011) (An earlier more optimistic assessment was 40–47%, Jennions & Møller, 2003.) For a third example, the proportion of statistically significant results in neuroscience and behaviour is 85%. Our best estimate of the statistical power in neuroscience is at best 31%, with a lower bound estimate of 8% (Button et al. 2013). The associated file-drawer problem (Rosenthal 1979)—where researchers relegate failed statistically non-significant studies to their file drawers, hidden from public view—has long been established in psychology and others disciplines, and is known to lead to distortions in meta-analysis (where a “meta-analysis” is a study which analyses results across multiple other studies).

In addition to creating the file-drawer problem described above, publication bias has been held at least partially responsible for the high prevalence of Questionable Research Practices (QRPs) uncovered in both self-report survey research (John, Loewenstein, & Prelec 2012; Agnoli 2017 et al. 2017; Fraser et al. 2018) and in journal studies that have detected, for example, unusual distributions of p values (Masicampo & Lalande 2012; Hartgerink et al. 2016). Pressure to publish, now ubiquitous across academic institutions, means that researchers often cannot afford to simply assign “failed” or statistically non-significant studies to the file drawer, so instead they p hack and cherry-pick results (as discussed below) back to significance, and back into the published literature. Simmons, Nelson, and Simonsohn (2011) explained and demonstrated with simulated results how engaging in such practices inflates the false positive error rate of the published literature, leading to a lower rate of reproducible results.

“ P hacking” refers to a set of practices which include: checking the statistical significance of results before deciding whether to collect more data; stopping data collection early because results have reached statistical significance; deciding whether to exclude data points (e.g., outliers) only after checking the impact on statistical significance and not reporting the impact of the data exclusion; adjusting statistical models, for instance by including or excluding covariates based on the resulting strength of the main effect of interest; and rounding of a p value to meet a statistical significance threshold (e.g., presenting 0.053 as P < .05). “Cherry picking” includes failing to report dependent or response variables or relationships that did not reach statistical significance or other threshold and/or failing to report conditions or treatments that did not reach statistical significance or other threshold. “HARKing” (Hypothesising After Results are Known) includes presenting ad hoc and/or unexpected findings as though they had been predicted all along (Kerr 1998); and presenting exploratory work as though it was confirmatory hypothesis testing (Wagenmakers et al. 2012). Five of the most widespread QRPs are listed below in Table 1 (from Fraser et al. 2018), with associated survey measures of prevalence.

Table 1: The prevalence of some common Questionable Research Practices. Percentage (with 95% confidence intervals) of researches who reported having used the QRP at least once (adapted from Fraser et al. 2018)


(Agnoli et al. 2017)

(John, Loewenstein, & Prelec 2012)

(Fraser et al. 2018)

(Fraser et al. 2018)
Not reporting response (outcome) variables that failed to reach statistical significance 47.9
63.4
64.1
63.7
Collecting more data after inspecting whether the results are statistically significant 53.2
55.9
36.9
50.7
Rounding-off a value or other quantity to meet a pre-specified threshold 22.2
22.0
27.3
17.5
Deciding to exclude data points after first checking the impact on statistical significance 39.7
38.2
24.0
23.9
Reporting an unexpected finding as having been predicted from the start 37.4
27.0
48.5
54.2

# cherry picking, * p hacking, ^ HARKing

Null Hypothesis Significance Testing (NHST)—discussed above—is a commonly diagnosed cause of the current replication crisis (see Szucs & Ioannidis 2017). The ubiquitous nature of NHST in life and behavioural sciences is well documented, most recently by Cristea and Ioannidis (2018). This is important pre-condition for establishing its role as a cause, since it could not be a cause if its actual use was rare. The dichotomous nature of NHST facilitates publication bias (Meehl 1967, 1978). For example, the language of accept and reject in hypothesis testing maps conveniently on to acceptance and rejection of manuscripts, a fact that led Rosnow and Rosenthal (1989) to decry that “surely God loves the .06 nearly as much as the .05” (1989: 1277). Techniques that failed to enshrine a dichotomous threshold would be harder to employ in service of publication bias. For example, a case has been made that estimation using effect sizes and confidence intervals (introduced above) would be less prone to being used in service of publication bias (Cumming 2012, Cumming and Calin-Jageman 2017.

As already mentioned, the average statistical power in various disciplines is low. Not only is power often low, but it is virtually never reported; less than 10% of published studies in psychology report statistical power and even fewer in ecology do (Fidler et al. 2006). Explanations for the widespread neglect of statistical power often highlight the many common misconceptions and fallacies associated with p values (e.g., Haller & Krauss 2002; Gigerenzer 2018). For example, the inverse probability fallacy [ 1 ] has been used to explain why so many researchers fail to calculate and report statistical power (Oakes 1986).

In 2017, a group of 72 authors proposed in a Nature Human Behaviour paper that alpha level in statistical significance testing be lowered to 0.005 (as opposed to the current standard of 0.05) to improve the reproducibility rate of published research (Benjamin et al. 2018). A reply from a different set of 88 authors was published in the same journal, arguing against this proposal and stating instead that researchers should justify their alpha level based on context (Lakens et al. 2018). Several other replies have followed, including a call from Andrew Gelman and colleagues to abandon statistical significance altogether (McShane et al. 2018, Other Internet Resources). The exchange has become known on social media as the Alpha Wars (e.g., in the Barely Significant blog, Other Internet Resources )). Independently, the American Statistical Association released a statement on the use of p values for the first time in its history, cautioning against their overinterpretation and pointing out the limits of the information they offer about replication (Wasserman & Lazar 2016) and devoted their association’s 2017 annual convention to the theme “Scientific Method for the 21 st Century: A World Beyond \(p <0.05\)” (see Other Internet Resources ).

A number of recent high-profile cases of scientific fraud have contributed considerably to the amount of press around the reproducibility crisis in science. Often these cases (e.g., Diederik Stapel in psychology) are used as a hook for media coverage, even though the crisis itself has very little to do with scientific fraud. (Note also that the Questionable Research Practices above are not typically counted as “fraud” or even “scientific misconduct” despite their ethically dubious status.) For example, Fang, Grant Steen, and Casadevall (2012) estimated that 43% of retracted articles in biomedical research are withdrawn because of fraud. However, roughly half a million biomedical articles are published annually and only 400 of those are retracted (Oransky 2016, founder of the website RetractionWatch), so this amounts to a very small proportion of the literature (approximately 0.1%). There are, of course, many cases of pharmaceutical companies exercising financial pressure on scientists and the publishing industry that raise speculation about how many undetected (or unretracted) cases there may still be in the literature. Having said that, there is widespread consensus amongst scientists in the field that the main cause of the current reproducibility crisis is the current incentive structure in science (publication bias, publish or perish, non-transparent statistical reporting, lack of rewards for data sharing). Whilst this incentive structure can push some to scientific fraud, it appears to be a very small proportion.

3. Epistemological Issues Related to Replication

Many scientists believe that replication is epistemically valuable in some way, that is to say, that replication serves a useful function in enhancing our knowledge, understanding or beliefs about reality. This section first discusses a problem about the epistemic value of replication studies—called the “experimenters regress”—and it then considers the claim that replication plays an epistemically valuable role in distinguishing scientific inquiry. It lastly examines a recent attempt to formalise the logic of replication in a Bayesian framework.

Collins (1985) articulated a widely discussed problem that is now known as the experimenters’ regress . He initially lays out the problem in the context of measurement (Collins 1985: 84). Suppose a scientist is trying to determine the accuracy of a measurement device and also the accuracy of a measurement result. Perhaps, for example, a scientist is using a thermometer to measure the temperature of a liquid, and it delivers a particular measurement result, say, 12 degrees Celsius.

The problem arises because of the interdependence of the accuracy of the measurement result and the accuracy of the measurement device: to know whether a particular measurement result is accurate, we need to test it against a measurement result that is previously known to be accurate, but to know that the result is accurate, we need to know that it has been obtained via an accurate measuring device, and so on. This, according to Collins, creates a “circle” which he refers to as the “experimenters’ regress”.

Collins extends the problem to scientific replication more generally. Suppose that an experiment B is a replication study of an initial experiment A , and that B ’s result apparently conflicts with A ’s result. This seeming conflict may have one of two interpretations:

  • The results of A and B deliver genuinely conflicting verdicts over the truth of the hypothesis under investigation
  • Experiment B was not in fact a proper replication of experiment A .

The regress poses a problem about how to choose between these interpretations, a problem which threatens the epistemic value of replication studies if there are no rational grounds for choosing in a particular way. Determining whether one experiment is a proper replication of another is complicated by the facts that scientific writing conventions often omit precise details of experimental methodology (Collins 2016), and, furthermore, much of the knowledge that scientists require to execute experiments is tacit and “cannot be fully explicated or absolutely established” (Collins 1985: 73).

In the context of experimental methodology, Collins wrote:

To know an experiment has been well conducted, one needs to know whether it gives rise to the correct outcome. But to know what the correct outcome is, one needs to do a well-conducted experiment. But to know whether the experiment has been well conducted…! (2016: 66; ellipses original)

Collins holds that in such cases where a conflict of results arises, scientists tend to fraction into two groups, each holding opposing interpretations of the results. According to Collins, where such groups are “determined” and the “controversy runs deep” (Collins 2016: 67), the dispute between the groups cannot be resolved via further experimentation, for each additional result is subject to the problem posed by the experimenters’ regress. [ 2 ] In such cases, Collins claims that particular non-epistemic factors will partly determine which interpretation becomes the lasting view:

the career, social, and cognitive interests of the scientists, their reputations and that of their institutions, and the perceived utility for future work. (Franklin & Collins 2016: 99)

Franklin was the most vociferous opponent of Collins, although recent collaboration between the two has fostered some agreement (Collins 2016). Franklin presented a set of strategies for validating experimental results, all of which relate to “rational argument” on epistemic grounds (Franklin 1989: 459; 1994). Examples include, for instance, appealing to experimental checks on measurement devices or eliminating potential sources of error in the experiment (Franklin & Collins 2016). He claimed that the fact that such strategies were evidenced in scientific practice “argues against those who believe that rational arguments plays little, if any, role” in such validation (Franklin 1989: 459), with Collins being an example. He interprets Collins as suggesting that the strategies for resolving debates of the validation of results are social factors or “culturally accepted practices” (Franklin, 1989: 459) which do not provide reasons to underpin rational belief about results. Franklin (1994) further claims that Collins conflates the difficulty in successfully executing experiments with the difficulty of demonstrating that experiments have been executed, with Feest (2016) interpreting him to say that although such execution requires tacit knowledge, one can nevertheless appeal to strategies to demonstrate the validity of experimental findings.

Feest (2016) examines a case study involving debates about the Mozart effect in psychology (which, roughly speaking, is the effect whereby listening to Mozart beneficially affects some aspect of intelligence or brain structure). Like Collins, she agrees that there is a problem in determining whether conflicting results suggest a putative replication experiment is not a proper replication attempt, in part because there is uncertainty about whether scientific concepts such as the Mozart effect have been appropriately operationalised in earlier or later experimental contexts. Unlike Collins (on her interpretation), however, she does not think that this uncertainty arises because scientists have inescapably tacit knowledge of the linguistic rules about the meaning and application of concepts like the Mozart effect. Rather the uncertainty arises because such concepts are still themselves developing and because of assumptions about the world that are required to successfully draw inferences from it. Experimental methodology then serves to reveal the previously tacit assumptions about the application of concepts and the legitimacy of inferences, assumptions which are then susceptible to scrutiny.

For example, in her study of the Mozart effect, she notes that replication studies of the Mozart effect failed to find that Mozart music had a beneficial influence on spatial abilities. Rauscher, who was the first to report results supporting the Mozart effect, suggested that the later studies were not proper replications of her study (Rauscher, Shaw, and Ky 1993, 1995). She clarified that the Mozart effect applied only to a particular category of spatial abilities (spatio-temporal processes) and that the later studies operationalised the Mozart effect in terms of different spatial abilities (spatial recognition). Here, then, there was a difficulty in determining whether to interpret failed replication results as evidence against the initial results or rather as an indication that the replication studies were not proper replications. Feest claims this difficulty arose because of tacit knowledge or assumptions: assumptions about the application of the Mozart effect concept to different kinds of spatial abilities, about whether the world is such that Mozart music has an effect on such abilities and about whether the failure of Mozart to impact other kinds of spatial abilities warrants the inference that the Mozart effect does not exist. Contra Collins, however, experimental methodology enabled the explication and testing of these assumptions, thus allowing scientists to overcome the interpretive impasse.

Against this background, her overall argument is that scientists often are and should be sceptical towards each other’s results. However, this is not because of inescapably tacit knowledge and the inevitable failure of epistemic strategies for validating results. Rather, it is at least in part because of varying tacit assumptions that researchers have about the meaning of concepts, about the world and about what to draw inferences from it. Progressive experimentation serves to reveal these tacit assumptions which can then be scrutinised, leading to the accumulation of knowledge.

There is also other philosophical literature on the experimenters’ regress, including Teira’s (2013) paper arguing that particular experimental debiasing procedures are defensible against the regress from a contractualist perspective, according to which self-interested scientists have reason to adopt good methodological standards.

There is a widespread belief that science is distinct from other knowledge accumulation endeavours, and some have suggested that replication distinguishes (or is at least essential to) science in this respect. (See also the entry on science and pseudo-science .). According to the Open Science Collaboration, “Reproducible research practices are at the heart of sound research and integral to the scientific method.” (OSC 2015: 7). Schmidt echoes this theme: “To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception” (2009: 90). Braude (1979) goes so far as to say that reproducibility is a “demarcation criterion between science and nonscience” (1979: 2). Similarly, Nosek, Spies, and Motyl state that:

[T]he scientific method differentiates itself from other approaches by publicly disclosing the basis of evidence for a claim…. In principle, open sharing of methodology means that the entire body of scientific knowledge can be reproduced by anyone. (2012: 618)

If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) considers the extent to which it is such a theme. He presents a variety of cases from the history of science where replication played very different roles, although he understands “replication” narrowly to refer to when an experiment is re-run by different researchers . He claims that the role and value of replication in experimental replication is “much more complex than easy textbook accounts make us believe” (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others.

For example, sometimes replication was sufficient to embrace a research claim, even if it conflicted with the background of accepted theory and left theoretical questions unresolved. A case of this is high-temperature superconductivity, the effect whereby an electric current can pass with zero resistance through a conductor at relatively high temperatures. In 1986, physicists Georg Bednorz and Alex Müller reported finding a material which acted as a superconductor at 35 kelvin (−238 degrees Celsius). Scientists around the world successfully replicated the effect, and Bednorz and Muller were then awarded with a Nobel Prize in Physics a year after their announcement. This case is remarkable since not only did their effect contradict the accepted physical theory at the time, but there is still no extant theory that adequately explains the effects which they reported (Di Bucchianico, 2014).

As a contrasting example, however, sometimes claims were accepted without any replication. In the 1650s, German scientist Otto von Guericke designed and operated the world’s first vacuum pump that would visibly suck air out of a larger space. He performed experiments with his device to various audiences. Yet the replication of his experiments by others would have been very difficult, if not impossible: not only was Guericke’s pump both expensive and complicated to build, but it was also unlikely that his descriptions of it sufficed to enable anyone to build the pump and to consequently replicate his findings. Despite this, Steinle claims that “no doubts were raised about his results”, probably as a results of his “public performances that could be witnessed by a large number of participants” (2016: 55).

Steinle takes such historical cases to provide normative guidance for understanding the epistemic value as replication as context-sensitive: whether replication is necessary or sufficient for establishing a research claim will depend on a variety of considerations, such as those mentioned earlier. He consequently eschews wide-reaching claims, such as those that “it’s all about replicability” or that “replicability does not decide anything” (2016: 60).

Earp and Trafimow (2015) attempt to formalise the way in which replication is epistemically valuable, and they do this using a Bayesian framework to explicate the inferences drawn from replication studies. They present the framework in a context similar to that of Collins (1985), noting that “it is well-nigh impossible to say conclusively what [replication results] mean” (Earp & Trafimow, 2015: 3). But while replication studies are often not conclusive, they do believe that such studies can be informative , and their Bayesian framework depicts how this is so.

The framework is set out with an example. Suppose an aficionado of Researcher A is highly confident that anything said by Researcher A is true. Some other researcher, Researcher B , then attempts to replicate an experiment by Researcher A , and Researcher B find results that conflict with those of Researcher A . Earp and Trafimow claim that the aficionado might continue to be confident in Researcher A ’s findings, but the aficionado’s confidence is likely to slightly decrease. As the number of failed replication attempts increases, the aficionado’s confidence accordingly decreases, eventually falling below 50% and thereby placing more confidence in the replication failures than in the findings initially reported by Researcher A .

Here, then, suppose we are interested in the probability that the original result reported by Researcher A is true given Researcher B ’s first replication failure. Earp and Trafimow represent this probability with the notation \(p(T\mid F)\) where p is a probability function, T represents the proposition that the original result is true and F represents Researcher B ’s replication failure. According to Bayes’s theorem below, this probability is calculable from the aficionado’s degree of confidence that the original result is true prior to learning of the replication failure \(p(T)\), their degree of expectation of the replication failure on the condition that the original result is true \(p(T\mid F)\), and the degree to which they would unconditionally expect a replication failure prior to learning of the replication failure \(p(F)\):

Relatedly, we could instead be interested in the confidence ratio that the original result is true or false given the failure to replicate. This ratio is representable as \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\) where \(\nneg T\) represents the proposition that the original result is false. According to the standard Bayesian probability calculus, this ratio in turn is related to a product of ratios concerning

  • the confidence that the original result is true \(\frac{p(T)}{p(\nneg T)}\) and
  • the expectation of a replication failure on the condition that the result is true or false \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\).

This relation is expressed in the equation:

Now Earp and Trafimow assign some values to the terms on the right-hand of the equation for (2). Supposing that the aficionado is confident in the original results, they set the ratio \(\frac{p(T)}{p(\nneg T)}\) to 50, meaning that the aficionado is initially fifty times more confident that the results are true than that the results are false.

They also set the ratio \(\frac{p(F\mid T)}{p(F\mid \nneg T)}\). about the conditional expectation of a replication failure to 0.5, meaning that the aficionado is considerably less confident that there will be a replication failure if the original result is true than if it is false. They point out that the extent to which the aficionado is less confident depends on the quality of so-called auxiliary assumptions about the replication experiment. Here, auxiliary assumptions are assumptions which enable one to infer that particular things should be observable if the theory under test is true. The intuitive idea is that the higher the quality of the assumptions about a replication study, the more one would expect to observe a successful replication if the original result was true. While they do not specify precisely what makes such auxiliary assumptions have high “quality” in this context, presumably this quality concerns the extent to which the assumptions are probably true and the extent to which the replication experiment is an appropriate test of the veracity of the original results if the assumptions are true.

Once the ratios on the right-hand of equation (2) are set as such, one can see that a replication failure would reduce one’s confidence in the original results:

Here, then, a replication failure would reduce the aficionado’s confidence that the original result was true so that the aficionado would be only 25 times more confident that the result is true given a failure (as per \(\frac{p(T\mid F)}{p(\nneg T\mid F)}\)) rather than 50 times more confident that it is true (as per \(\frac{p(T)}{p(\nneg T)}\)).

Nevertheless, the aficionado may still be confident that the original result is true, but we can see how such confidence would decrease with successive replication failures. More formally, let \(F_N\) be the last replication failure in a sequence of N replication failures \(\langle F_1,F_2,\ldots,F_N\rangle\). Then, the aficionado’s confidence in the original result given the N th replication failure is expressible in the equation: [ 3 ]

For example, suppose there are 10 replication failures, and so \(N=10\). Suppose further that the confidence ratios for the replication failures are set such that:

Here, then, the aficionado’s confidence in the original result decreases so that they are more confident that it was false than that it was true. Hence, on Earp and Trafimow’s Bayesian account, successive replication failures can progressively erode one’s confidence that an original result is true, even if one was initially highly confident in the original result and even if no single replication failure by itself was conclusive. [ 4 ]

Some putative merits of Earp and Trafimow’s account, then, are that it provides a formalisation whereby replication attempts are informative even if they are not conclusive, and furthermore, the formalisation provides a role for both quantity of replication attempts as well as auxiliary assumptions about the replications.

4. Open Science Reforms: Values, Tone, and Scientific Norms

The aforementioned meta-science has unearthed a range of problems which give rise to the reproducibility crisis, and the open science movement has proposed or promoted various solutions—or reforms—for these problems. These reforms can be grouped into four categories: (a) methods and training, (b) reporting and dissemination, (c) peer review processes, and (d) evaluating new incentive structures (loosely following the categories used by Munafò et al. 2017 and Ioannidis et al. 2015). In subsections 4.1–4.4 below, we present a non-exhaustive list of initiatives in each of the above categories. These initiatives are reflections of various values and norms that are at the heart of the open science movement, and we discuss these values and norms in 4.5.

  • Combating bias. The development of methods for combating bias, for example, masked or blind analysis techniques to combat confirmation bias (e.g., MacCoun & Perlmutter 2017).
  • Support. Providing methodological support for researchers, including published guidelines and statistical consultancy (for example, as offered by the Center for Open Science) and large online courses such as that developed by Daniel Lakens (see Other Internet Resources ).
  • Collaboration. Promoting collaboration and team/crowd sourced science to combat low power and other methodological limitations of single studies. The Reproducibility Projects themselves are an example of this, but there are other initiatives too such StudySwap in psychology and the Collective Replication and Education Project (CREP, see Other Internet Resources for both of these , see also Munafò et al. for a more detailed description) which aims to increase the prevalence of replications through undergraduate education.
  • The TOP Guidelines. The Transparency and Openness Promotion (TOP) guidelines (Nosek et al. 2015) have, as at the end of May, 2018, almost 5,000 journals and organizations as signatories. Developed within psychology, TOP guidelines have formed the basis of other disciplinary specific guidelines, such as the Tools for Transparency in Ecology and Evolution (TTEE). As the name suggests, these guidelines promote more complete and transparent reporting of methodological and statistical practices. This in turn enables authors, reviewers and editors to consider detailed aspects of their sample size planning and design decisions, and to clearly distinguish between confirmatory (planned) analysis and exploratory (post hoc) analysis.
  • Pre-registration . In its simplest form, pre-registration involves making a public, date-stamped statement of predictions and/or hypotheses, either before data is collected, viewed or analysed. The purpose is to distinguish prediction from postdiction (Nosek et al. 2018), or what is elsewhere referred to as confirmatory research from exploratory research (Wagenmakers et al. 2012) and a distinction perhaps more commonly known as hypothesis testing versus hypothesis generating research. Pre-registration of predictive research helps control for HARKing (Kerr 1998) and hindsight bias, and within the frequentist Null Hypothesis Significance Testing, helps contain the false positive error rate to the set alpha level. There are several platforms that host pre-registrations, such as the Open Science Framework (osf.io) and As Predicted (aspredicted.org). The Open Science Framework also hosts a “pre-registration challenge” offering monetary rewards for publishing pre-registered work.
  • Specific Journal Initiatives . Some high impact journals, having been singled out in the science media as having particularly problematic publishing practices (e.g., Schekman 2013), have taken exceptional steps to improve the completeness, transparency and reproducibility of the research they publish. For example, since 2013, Nature and Nature research journals have engaged in a range of editorial activities aimed at improving reproducibility of research published in their journals (see the editorial announcement , Nature 496, 398, 25 April 2013, doi:10.1038/496398a). In 2017, they introduced checklists and reporting summaries (published alongside articles) in an effort to improve transparency and reproducibility. In 2018, they produced discipline specific versions for Nature Human Behaviour and Nature Ecology & Evolution . Within psychology, the journal Psychological Science (flagship journal of the Association of Psychological Science) was the first to adopt open science practices, such the COS Open Science badges described below. Following a meeting of ecology and evolution journal editors in 2015, a number of journals in these fields have run editorials on this topic, often committing to TTEE guidelines (discussed above). Conservation Biology has in addition adopted a checklist for associate editors (Parker et al. 2016).
  • Registered reports . Registered reports shift the point at which peer review occurs in the research process, in an effort to combat publication bias against null (negative) results. Manuscripts are submitted, reviewed and a publication decision made on the basis of the introduction, methods and planned analysis alone. If accepted, authors then have a defined period of time to carry out the planned research and submit the results. Assuming authors followed their original plans (or adequately justified deviations from them), the journal will honour its decision to publish, regardless of the result outcomes. In psychology, the Registered Report format has been championed by Chris Chambers, with the journal Cortex being the first to adopt the format under Chambers’ editorship (Chambers 2013, 2017; Nosek & Lakens 2014). Currently (end of May 2018), 108 journals in a range of biomedical, psychology and neuroscience fields, offer the format (see Registered Reports in Other Internet Resources ).
  • Pre-prints. Well-established in some sciences like physics, the use of pre-print servers is relatively new in biological and social sciences.
  • Open Science badges. A recent review of initiatives for improving data sharing identified the awarding of open data and open materials badges as the most effective scheme (Rowhani-Farid, Allen, & Barnett 2017). One such badge scheme is coordinated by the Center for Open Science who currently award three badges: Open Data, Open Materials and Pre-Registration. Badges are attached to articles that follow a specific set of criteria to engage in these activities. Kidwell et al. (2016) evaluated the effectiveness of badges in the journal Psychological Science and found substantial increases (from 3 to 39%) in data sharing over a less than two-year period. Such increases were not found in similar journals without badge schemes over the same period.

There has long been philosophical debate about what role values do and should play in science (Churchman 1948; Rudner 1953; Douglas 2016), and the reproducibility crisis is intimately connected to questions about the operations of, and interconnections between, such values. In particular, Nosek et al. (2017) argue that there is a tension between truth and publishability. More specifically, for reasons discussed in section 2 above, the accuracy of scientific results are compromised by the value which journals place on novel and positive results and, consequently, by scientists who value career success to seek to exclusively publish such results in these journals. Many others in addition to Nosek et al. (Hackett 2005; Martin 1992; Sovacool 2008) have taken also take issue with the value which journals and funding bodies have placed on novelty.

Some might interpret the tension as a manifestation of how epistemic values (such as truth and replicability) can be compromised by (arguably) non-epistemic values, such the value of novel, interesting or surprising results. Epistemic values are typically taken to be values that, in the words of Steel “promote the acquisition of true beliefs” (2010: 18; see also Goldman 1999). Canonical examples of epistemic values include the predictive accuracy and internal consistency of a theory. Epistemic values are often contrasted with putative non-epistemic or non-cognitive values, which include ethical or social values like, for example, the novelty of a theory or its ability to improve well-being by lessening power inequalities (Longino 1996). Of course, there is no complete consensus as to precisely what counts as an epistemic or non-epistemic value (Rooney 1992; Longino 1996). Longino, for example, claims that, other things being equal, novelty counts in favour of accepting a theory, and convincingly argues that, in some contexts, it can serve as a “protection against unconscious perpetuation of the sexism and androcentrism” in traditional science (1997: 22). However, she does not discuss novelty specifically in the context of the reproducibility crisis.

Giner-Sorolla (2012), however, does discuss novelty in the context of the crisis, and he offers another perspective on its value. He claims that one reason novelty has been used to define what is publishable or fundable is that it is relatively easy for researchers to establish and for reviewers and editors to detect. Yet, Giner-Sorolla argues, novelty for its own sake perhaps should not be valued, and should in fact be recognized as merely an operationalisation of a deeper concept, such as “ability to advance the field” (567). Giner-Sorolla goes on to point out how such shallow operationalisations of important concepts often lead to problems, for example, using statistical significance to measure the importance of results, or measuring the quality of research by how well outcomes fit with experimenters’ prior expectations.

Values are closely connected to discussions about norms in the open science movement. Vazire (2018) and others invoke norms of science— communality, universalism, disinterestedness and organised skepticism —in setting the goals for open science, norms originally articulated by Robert Merton (1942). Each such norm arguably reflects a value which Merton advocated, and each norm may be opposed by a counternorm which denotes behaviour that is in conflict with the norm. For example, the norm of communality (which Merton called “communism”) reflects the value of collaboration and the common ownership of scientific goods since the norm recommends such collaboration and common ownership. Advocates of open science see such norms, and the values which they reflect, as an aim for open science. For example, the norm of communality is reflected in sharing and making data open, and in open access publishing. In contrast, the counternorm of secrecy is associated with a closed, for profit publishing system (Anderson et al. 2010). Likewise, assessing scientific work on its merits upholds the norm of universalism—that the evaluation of research claims should not depend on the socio-demographic characteristics of the proponents of such claims. In contrast, assessing work by the age, the status, the institution or the metrics of the journal it is published in reflects a counternorm of particularism.

Vazire (2018) and others have argued that, at the moment, scientific practice is dominated by counternorms and that a move to Mertonian norms is a goal of the open science reform movement. In particular, self-interestedness, as opposed to the norm of disinterestedness, motivates p -hacking and other Questionable Research Practices. Similarly, a desire to protect one’s professional reputation motivates resistance to having one’s work replicated by others (Vazire 2018). This in turn reinforces a counternorm of organized dogmatism rather than organized skepticism which, according to Merton, involves the “temporary suspension of judgment and the detached scrutiny of beliefs” (Merton, 1973).

Anderson et al.’s (2010) focus groups and surveys of scientists suggest that scientists do want to adhere to Merton’s norms but that the current incentive structure of science makes this difficult. Changing the structure of penalty and reward systems within science to promote communality, universalism, disinterestedness and organized skepticism instead of their counternorms is an ongoing challenge for the open science reform movement. As Pashler and Wagenmakers (2012) have said:

replicability problems will not be so easily overcome, as they reflect deep-seated human biases and well-entrenched incentives that shape the behavior of individuals and institutions. (2012: 529)

The effort to promote such values and norms has generated heated controversy. Some early responses to the Reproducibility Project: Psychology and Many Labs projects were highly critical, not just of the substance of the nature and process of the work. Calls for openness were interpreted as reflecting mistrust, and attempts to replicate others’ work as personal attacks (e.g., Schnail 2014 in Other Internet Resources ). Nosek, Spies, & Motyl (2012) argue that calls for openness should not be interepreted as mistrust:

Opening our research process will make us feel accountable to do our best to get it right; and, if we do not get it right, to increase the opportunities for others to detect the problems and correct them. Openness is not needed because we are untrustworthy; it is needed because we are human. (2012: 626)

Exchanges related to this have become known as the tone debate . [ ]

The subject of reproducibility is associated with a turbulent period in contemporary science. This period has called for a re-evaluation of the values, incentives, practices and structures which underpin scientific inquiry. While the meta-science has painted a bleak picture of reproducibility in some fields, it has also inspired a parallel movement to strengthen the foundations of science. However, more progress is to be made, especially in understanding the solutions to the reproducibility crisis. In this regard, there are fruitful avenues for future research, including a deeper exploration of the role that epistemic and non-epistemic values can or should play in scientific inquiry.

  • Agnoli, Franca, Jelte M. Wicherts, Coosje L. S. Veldkamp, Paolo Albiero, and Roberto Cubelli, 2017, “Questionable Research Practices among Italian Research Psychologists”, Jakob Pietschnig (ed.), PLoS ONE , 12(3): e0172792. doi:10.1371/journal.pone.0172792
  • Allen, Peter J., Kate P. Dorozenko, and Lynne D. Roberts, 2016, “Difficult Decisions: A Qualitative Exploration of the Statistical Decision Making Process from the Perspectives of Psychology Students and Academics”, Frontiers in Psychology , 7(February): 188. doi:10.3389/fpsyg.2016.00188
  • Anderson, Christopher J., Štěpán Bahnik, Michael Barnett-Cowan, Frank A. Bosco, Jesse Chandler, C. R. Chartier, F. Cheung, et al., 2016, “Response to Comment on ‘Estimating the Reproducibility of Psychological Science’”, Science , 351(6277): 1037. doi:10.1126/science.aad9163
  • Anderson, Melissa S., Emily A. Ronning, Raymond De Vries, and Brian C. Martinson, 2010, “Extending the Mertonian Norms: Scientists’ Subscription to Norms of Research”, The Journal of Higher Education , 81(3): 366–393. doi:10.1353/jhe.0.0095
  • Atmanspacher, Harald and Sabine Maasen, 2016a, “Introduction”, in Atmanspacher and Maasen 2016b: 1–8. doi:10.1002/9781118865064.ch0
  • ––– (eds.), 2016b, Reproducibility: Principles, Problems, Practices, and Prospects , Hoboken, NJ: John Wiley & Sons. doi:10.1002/9781118865064
  • Baker, Monya, 2016, “1,500 Scientists Lift the Lid on Reproducibility”, Nature , 533(7604): 452–454. doi:10.1038/533452a
  • Bakker, Marjan, Chris H. J. Hartgerink, Jelte M. Wicherts, and Han L. J. van der Maas, 2016, “Researchers’ Intuitions About Power in Psychological Research”, Psychological Science , 27(8): 1069–1077. doi:10.1177/0956797616647519
  • Bakker, Marjan and Jelte M. Wicherts, 2011, “The (Mis)Reporting of Statistical Results in Psychology Journals”, Behavior Research Methods , 43(3): 666–678. doi:10.3758/s13428-011-0089-5
  • Begley, C. Glenn and Lee M. Ellis, 2012, “Raise Standards for Preclinical Cancer Research: Drug Development”, Nature , 483(7391): 531–533. doi:10.1038/483531a
  • Bem, Daryl J., 2011, “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal of Personality and Social Psychology , 100(3): 407–425.
  • Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, Eric-Jan Wagenmakers, Richard Berk, Kenneth A. Bollen, et al., 2018, “Redefine Statistical Significance”, Nature Human Behaviour , 2(1): 6–10. doi:10.1038/s41562-017-0189-z
  • Braude, Stephen E., 1979, ESP and Psychokinesis. A Philosophical Examination , Philadelphia: Temple University Press.
  • Button, Katherine S., John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò, 2013, “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience”, Nature Reviews Neuroscience , 14(5): 365–376. doi:10.1038/nrn3475
  • Camerer C.F., et al., 2018, “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015”, Nature Human Behaviour , 2: 637–644. doi: 10.1038/s41562-018-0399-z
  • Cartwright, Nancy, 1991, “Replicability, Reproducibility and Robustness: Comments on Harry Collins”, History of Political Economy , 23(1): 143–155.
  • Chambers, Christopher D., 2013, “Registered Reports: A New Publishing Initiative at Cortex”, Cortex , 49(3): 609–610. doi:10.1016/j.cortex.2012.12.016
  • –––, 2017, The Seven Deadly Sins of Psychology A Manifesto for Reforming the Culture of Scientific Practice , Princeton: Princeton University Press.
  • Chang, Andrew C. and Phillip Li, 2015, “Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ‘Usually Not’”, Finance and Economics Discussion Series , 2015(83): 1–26. doi:10.17016/FEDS.2015.083
  • Churchman, C. West, 1948, “Statistics, Pragmatics, Induction”, Philosophy of Science , 15(3): 249–268. doi:10.1086/286991
  • Collins, Harry M., 1985, Changing Order: Replication and Induction in Scientific Practice , London; Beverly Hills: Sage Publications.
  • –––, 2016, “Reproducibility of experiments: experiments’ regress, statistical uncertainty principle, and the replication imperative” in Atmanspacher and Maasen 2016b: 65–82. doi:10.1002/9781118865064.ch4
  • Cohen, Jacob, 1962, “The Statistical Power of Abnormal-Social Psychological Research: A Review”,, The Journal of Abnormal and Social Psychology , 65(3): 145–153. doi:10.1037/h0045186
  • –––, 1994, “The Earth Is Round (\(p < .05\))”, American Psychologist , 49(12): 997–1003, doi:10.1037/0003-066X.49.12.997
  • Cova, Florian, Brent Strickland, Angela Abatista, Aurélien Allard, James Andow, Mario Attie, James Beebe, et al., forthcoming, “Estimating the Reproducibility of Experimental Philosophy”, Review of Philosophy and Psychology , early online: 14 June 2018. doi:10.1007/s13164-018-0400-9
  • Cristea, Ioana Alina and John P. A. Ioannidis, 2018, “P Values in Display Items Are Ubiquitous and Almost Invariably Significant: A Survey of Top Science Journals”, Christos A. Ouzounis (ed.), PLoS ONE , 13(5): e0197440. doi:10.1371/journal.pone.0197440
  • Cumming, Geoff, 2012, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis . New York: Routledge.
  • Cumming, Geoff and Robert Calin-Jageman, 2017, Introduction to the New Statistics: Estimation, Open Science and Beyond , New York: Routledge.
  • Cumming, Geoff, Fiona Fidler, Martine Leonard, Pavel Kalinowski, Ashton Christiansen, Anita Kleinig, Jessica Lo, Natalie McMenamin, and Sarah Wilson, 2007, “Statistical Reform in Psychology: Is Anything Changing?”, Psychological Science , 18(3): 230–232. doi:10.1111/j.1467-9280.2007.01881.x
  • Di Bucchianico, Marilena, 2014, “A Matter of Phronesis: Experiment and Virtue in Physics, A Case Study”, in Virtue Epistemology Naturalized , Abrol Fairweather (ed.), Cham: Springer International Publishing, 291–312. doi:10.1007/978-3-319-04672-3_17
  • Dominus, Susan, 2017, “When the Revolution Came for Amy Cuddy”, The New York Times , October 21, Sunday Magazine, page 29.
  • Douglas, Heather, 2016, “Values in Science”, in Paul Humphreys, The Oxford Handbook of Philosophy of Science , New York: Oxford University Press, pp. 609–630.
  • Earp, Brian D. and David Trafimow, 2015, “Replication, Falsification, and the Crisis of Confidence in Social Psychology”, Frontiers in Psychology , 6(May): 621. doi:10.3389/fpsyg.2015.00621
  • Errington, Timothy M., Elizabeth Iorns, William Gunn, Fraser Elisabeth Tan, Joelle Lomax, and Brian A Nosek, 2014, “An Open Investigation of the Reproducibility of Cancer Biology Research”, ELife , 3(December): e043333. doi:10.7554/eLife.04333
  • Etz, Alexander and Joachim Vandekerckhove, 2016, “A Bayesian Perspective on the Reproducibility Project: Psychology”, Daniele Marinazzo (ed.), PLoS ONE , 11(2): e0149794. doi:10.1371/journal.pone.0149794
  • Fanelli, Daniele, 2010a, “Do Pressures to Publish Increase Scientists’ Bias? An Empirical Support from US States Data”, Enrico Scalas (ed.), PLoS ONE , 5(4): e10271. doi:10.1371/journal.pone.0010271
  • –––, 2010b, “‘Positive’ Results Increase Down the Hierarchy of the Sciences”, Enrico Scalas (ed.), PLoS ONE , 5(4): e10068. doi:10.1371/journal.pone.0010068
  • –––, 2012, “Negative Results Are Disappearing from Most Disciplines and Countries”, Scientometrics , 90(3): 891–904. doi:10.1007/s11192-011-0494-7
  • Fang, Ferric C., R. Grant Steen, and Arturo Casadevall, 2012, “Misconduct Accounts for the Majority of Retracted Scientific Publications”, Proceedings of the National Academy of Sciences , 109(42): 17028–17033. doi:10.1073/pnas.1212247109
  • Feest, Uljana, 2016, “The Experimenters’ Regress Reconsidered: Replication, Tacit Knowledge, and the Dynamics of Knowledge Generation”, Studies in History and Philosophy of Science Part A , 58(August): 34–45. doi:10.1016/j.shpsa.2016.04.003
  • Fidler, Fiona, Mark A. Burgman, Geoff Cumming, Robert Buttrose, and Neil Thomason, 2006, “Impact of Criticism of Null-Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology”, Conservation Biology , 20(5): 1539–1544. doi:10.1111/j.1523-1739.2006.00525.x
  • Fidler, Fiona, Yung En Chee, Bonnie C. Wintle, Mark A. Burgman, Michael A. McCarthy, and Ascelin Gordon, 2017, “Metaresearch for Evaluating Reproducibility in Ecology and Evolution”, BioScience , 67(3): 282–289. doi:10.1093/biosci/biw159
  • Fiedler, Klaus and Norbert Schwarz, 2016, “Questionable Research Practices Revisited”, Social Psychological and Personality Science , 7(1): 45–52. doi:10.1177/1948550615612150
  • Fiske, Susan T., 2016, “A Call to Change Science’s Culture of Shaming”, Association for Psychological Science Observer , 29(9). [ Fiske 2016 available online ]
  • Franklin, Allan, 1989, “The Epistemology of Experiment”, in David Gooding, Trevor Pinch, and Simon Schaffer (eds.), The Uses of Experiment: Studies in the Natural Sciences , Cambridge: Cambridge University Press, pp. 437–460.
  • –––, 1994, “How to Avoid the Experimenters’ Regress”, Studies in History and Philosophy of Science Part A , 25(3): 463–491. doi:10.1016/0039-3681(94)90062-0
  • Franklin, Allan and Harry Collins, 2016, “Two Kinds of Case Study and a New Agreement”, in The Philosophy of Historical Case Studies , Tilman Sauer and Raphael Scholl (eds.), Cham: Springer International Publishing, 319: 95–121. doi:10.1007/978-3-319-30229-4_6
  • Fraser, Hannah, Tim Parker, Shinichi Nakagawa, Ashley Barnett, and Fiona Fidler, 2018, “Questionable Research Practices in Ecology and Evolution”, Jelte M. Wicherts (ed.), PLoS ONE , 13(7): e0200303. doi:10.1371/journal.pone.0200303
  • Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe, 2015, “The Economics of Reproducibility in Preclinical Research”, PLoS Biology , 13(6): e1002165. doi:10.1371/journal.pbio.1002165
  • Giner-Sorolla, Roger, 2012, “Science or Art? How Aesthetic Standards Grease the Way Through the Publication Bottleneck but Undermine Science”, Perspectives on Psychological Science , 7(6): 562–571. doi:10.1177/1745691612457576
  • Gigerenzer, Gerd, 2018, “Statistical Rituals: The Replication Delusion and How We Got There”, Advances in Methods and Practices in Psychological Science , 1(2): 198–218. doi:10.1177/2515245918771329
  • Gilbert, Daniel T., Gary King, Stephen Pettigrew, and Timothy D. Wilson, 2016, “Comment on ‘Estimating the Reproducibility of Psychological Science’”, Science , 351(6277): 1037–1037. doi:10.1126/science.aad7243
  • Goldman, Alvin I., 1999, Knowledge in a Social World , Oxford: Clarendon. doi:10.1093/0198238207.001.0001
  • Gómez, Omar S., Natalia Juristo, and Sira Vegas, 2010, “Replications Types in Experimental Disciplines”, in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement - ESEM ’10 , Bolzano-Bozen, Italy: ACM Press. doi:10.1145/1852786.1852790
  • Hackett, B., 2005, “Essential tensions: Identity, control, and risk in research”, Social Studies of Science , 35(5): 787–826. doi:10.1177/0306312705056045
  • Haller, Heiko, and Stefan Krauss, 2002, “Misinterpretations of Significance: a Problem Students Share with Their Teachers?” Methods of Psychological Research—Online , 7(1): 1–20. [ Haller & Kraus 2002 available online ]
  • Hartgerink, Chris H.J., Robbie C.M. van Aert, Michèle B. Nuijten, Jelte M. Wicherts, and Marcel A.L.M. van Assen, 2016, “Distributions of p -Values Smaller than .05 in Psychology: What Is Going On?”, PeerJ , 4(April): e1935. doi:10.7717/peerj.1935
  • Hendrick, Clyde, 1991. “Replication, Strict Replications, and Conceptual Replications: Are They Important?”, in Neuliep 1991: 41–49.
  • Ioannidis, John P. A., 2005, “Why Most Published Research Findings Are False”, PLoS Medicine , 2(8): e124. doi:10.1371/journal.pmed.0020124
  • Ioannidis, John P. A., Daniele Fanelli, Debbie Drake Dunne, and Steven N. Goodman, 2015, “Meta-Research: Evaluation and Improvement of Research Methods and Practices”, PLOS Biology , 13(10): e1002264. doi:10.1371/journal.pbio.1002264
  • Jennions, Michael D. and Anders Pape Møller, 2003, “A Survey of the Statistical Power of Research in Behavioral Ecology and Animal Behavior”, Behavioral Ecology , 14(3): 438–445. doi:10.1093/beheco/14.3.438
  • John, Leslie K., George Loewenstein, and Drazen Prelec, 2012, “Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling”, Psychological Science , 23(5): 524–532. doi:10.1177/0956797611430953
  • Kaiser, Jocelyn, 2018, “Plan to Replicate 50 High-Impact Cancer Papers Shrinks to Just 18”, Science , 31 July 2018. doi:10.1126/science.aau9619
  • Keppel, Geoffrey, 1982, Design and Analysis. A Researcher’s Handbook , second edition, Englewood Cliffs, NJ: Prentice-Hall.
  • Kerr, Norbert L., 1998, “HARKing: Hypothesizing After the Results Are Known”, Personality and Social Psychology Review , 2(3): 196–217. doi:10.1207/s15327957pspr0203_4
  • Kidwell, Mallory C., Ljiljana B. Lazarević, Erica Baranski, Tom E. Hardwicke, Sarah Piechowski, Lina-Sophia Falkenberg, Curtis Kennett, et al., 2016, “Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency”, Malcolm R Macleod (ed.), PLOS Biology , 14(5): e1002456. doi:10.1371/journal.pbio.1002456
  • Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams, Štěpán Bahník, Michael J. Bernstein, Konrad Bocian, et al., 2014, “Investigating Variation in Replicability: A ‘Many Labs’ Replication Project”, Social Psychology , 45(3): 142–152. doi:10.1027/1864-9335/a000178
  • Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A. J. Apps, Shlomo E. Argamon, Thom Baguley, et al., 2018, “Justify Your Alpha”, Nature Human Behaviour , 2(3): 168–171. doi:10.1038/s41562-018-0311-x
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton: Princeton University Press.
  • –––, 1996, “Cognitive and Non-Cognitive Values in Science: Rethinking the Dichotomy”, in Feminism, Science, and the Philosophy of Science , Lynn Hankinson Nelson and Jack Nelson (eds.), Dordrecht: Springer Netherlands, 39–58. doi:10.1007/978-94-009-1742-2_3
  • –––, 1997, “Feminist Epistemology as a Local Epistemology: Helen E. Longino”, Aristotelian Society Supplementary Volume , 71(1): 19–35. doi:10.1111/1467-8349.00017
  • Lykken, David T., 1968, “Statistical Significance in Psychological Research”, Psychological Bulletin , 70(3, Pt.1): 151–159. doi:10.1037/h0026141
  • Madden, Charles S., Richard W. Easley, and Mark G. Dunn, 1995, “How Journal Editors View Replication Research”, Journal of Advertising , 24(December): 77–87. doi:10.1080/00913367.1995.10673490
  • Makel, Matthew C., Jonathan A. Plucker, and Boyd Hegarty, 2012, “Replications in Psychology Research: How Often Do They Really Occur?”, Perspectives on Psychological Science , 7(6): 537–542. doi:10.1177/1745691612460688
  • MacCoun, Robert J. and Saul Perlmutter, 2017, “Blind Analysis as a Correction for Confirmatory Bias in Physics and in Psychology”, in Psychological Science Under Scrutiny , Scott O. Lilienfeld and Irwin D. Waldman (eds.), Hoboken, NJ: John Wiley & Sons, pp. 295–322. doi:10.1002/9781119095910.ch15
  • Martin, B., 1992, “Scientific fraud and the power structure of science”, Prometheus , 10(1): 83–98. doi:10.1080/08109029208629515
  • Masicampo, E.J. and Daniel R. Lalande, 2012, “A Peculiar Prevalence of p Values Just below .05”, Quarterly Journal of Experimental Psychology , 65(11): 2271–2279. doi:10.1080/17470218.2012.711335
  • Mahoney, Michael J., 1985, “Open Exchange and Epistemic Progress”,, American Psychologist , 40(1): 29–39. doi:10.1037/0003-066X.40.1.29
  • Meehl, Paul E., 1967, “Theory-Testing in Psychology and Physics: A Methodological Paradox”, Philosophy of Science , 34(2): 103–115. doi:10.1086/288135
  • –––, 1978, “Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology”, Journal of Consulting and Clinical Psychology , 46(4): 806–834. doi:10.1037/0022-006X.46.4.806
  • Merton, Robert K., 1942 [1973], “A Note on Science and Technology in a Democratic Order”, Journal of Legal and Political Sociology , 1(1–2): 115–126; reprinted as “The Normative Structure of Science”, in Robert K. Merton (ed.) The Sociology of Science: Theoretical and Empirical Investigations , Chicago, IL: University of Chicago Press.
  • Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis, 2017, “A Manifesto for Reproducible Science”, Nature Human Behaviour , 1(1): 0021. doi:10.1038/s41562-016-0021
  • Neuliep, James William (ed.), 1991, Replication Research in the Social Sciences , (Journal of social behavior and personality; 8: 6), Newbury Park, CA: Sage Publications.
  • Neuliep, James W. and Rick Crandall, 1990, “Editorial Bias Against Replication Research”, Journal of Social Behavior and Personality , 5(4): 85–90
  • Nosek, Brian A. and Daniël Lakens, 2014, “Registered Reports: A Method to Increase the Credibility of Published Results”, Social Psychology , 45(3): 137–141. doi:10.1027/1864-9335/a000192
  • Nosek, Brian A., Jeffrey R. Spies, and Matt Motyl, 2012, “Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability”, Perspectives on Psychological Science , 7(6): 615–631. doi:10.1177/1745691612459058
  • Nosek, B. A., G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, et al., 2015, “Promoting an Open Research Culture”, Science , 348(6242): 1422–1425. doi:10.1126/science.aab2374,
  • Nosek, Brian A., Charles R. Ebersole, Alexander C. DeHaven, and David T. Mellor, 2018, “The Preregistration Revolution”, Proceedings of the National Academy of Sciences , 115(11): 2600–2606. doi:10.1073/pnas.1708274114
  • Nuijten, Michèle B., Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, and Jelte M. Wicherts, 2016, “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013)”, Behavior Research Methods , 48(4): 1205–1226. doi:10.3758/s13428-015-0664-2
  • Oakes, Michael, 1986, Statistical Inference: A Commentary for the Social and Behavioral Sciences , New York: Wiley.
  • Open Science Collaboration (OSC), 2015, “Estimating the Reproducibility of Psychological Science”, Science , 349(6251): 943–951. doi:10.1126/science.aac4716
  • Oransky, Ivan, 2016, “Half of Biomedical Studies Don’t Stand up to Scrutiny and What We Need to Do about That”, The Conversation , 11 November 2016. [ Oransky 2016 available online ]
  • Parker, T.H., E. Main, S. Nakagawa, J. Gurevitch, F. Jarrad, and M. Burgman, 2016, “Promoting Transparency in Conservation Science: Editorial”, Conservation Biology , 30(6): 1149–1150. doi:10.1111/cobi.12760
  • Pashler, Harold and Eric-Jan Wagenmakers, 2012, “Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?”, Perspectives on Psychological Science , 7(6): 528–530. doi:10.1177/1745691612465253
  • Peng, Roger D., 2011, “Reproducible Research in Computational Science”, Science , 334(6060): 1226–1227. doi:10.1126/science.1213847
  • –––, 2015, “The Reproducibility Crisis in Science: A Statistical Counterattack”, Significance , 12(3): 30–32. doi:10.1111/j.1740-9713.2015.00827.x
  • Radder, Hans, 1996, In And About The World: Philosophical Studies Of Science And Technology , Albany, NY: State University of New York Press.
  • –––, 2003, “Technology and Theory in Experimental Science”, in Hans Radder (ed.), The Philosophy of Scientific Experimentation , Pittsburgh: University of Pittsburgh Press, pp. 152–173.
  • –––, 2006, The World Observed/The World Conceived , Pittsburgh, PA: University of Pittsburgh Press.
  • –––, 2009, “Science, Technology and the Science-Technology Relationship”, in Anthonie Meijers (ed.), Philosophy of Technology and Engineering Sciences , Amsterdam: Elsevier, pp. 65–91. doi:10.1016/B978-0-444-51667-1.50007-0
  • –––, 2012, The Material Realization of Science: From Habermas to Experimentation and Referential Realism , Boston: Springer. doi:10.1007/978-94-007-4107-2
  • Rauscher, Frances H., Gordon L. Shaw, and Catherine N. Ky, 1993, “Music and Spatial Task Performance”, Nature , 365(6447): 611–611. doi:10.1038/365611a0
  • Rauscher, Frances H., Gordon L. Shaw, and Katherine N. Ky, 1995, “Listening to Mozart Enhances Spatial-Temporal Reasoning: Towards a Neurophysiological Basis”, Neuroscience Letters , 185(1): 44–47. doi:10.1016/0304-3940(94)11221-4
  • Ritchie, Stuart J., Richard Wiseman, and Christopher C. French, 2012, “Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s ‘Retroactive Facilitation of Recall’ Effect”, Sam Gilbert (ed.), PLoS ONE , 7(3): e33423. doi:10.1371/journal.pone.0033423
  • Rooney, Phyllis, 1992, “On Values in Science: Is the Epistemic/Non-Epistemic Distinction Useful?”, PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association , 1992(1): 13–22. doi:10.1086/psaprocbienmeetp.1992.1.192740
  • Rosenthal, Robert, 1979, “The File Drawer Problem and Tolerance for Null Results”, Psychological Bulletin , 86(3): 638–641. doi:10.1037/0033-2909.86.3.638
  • –––, 1991, “Replication in Behavioral Research”, in Neuliep 1991: 1–39.
  • Rosnow, Ralph L. and Robert Rosenthal, 1989, “Statistical Procedures and the Justification of Knowledge in Psychological Science”,, American Psychologist , 44(10): 1276–1284. doi:10.1037/0003-066X.44.10.1276
  • Rowhani-Farid, Anisa, Michelle Allen, and Adrian G. Barnett, 2017, “What Incentives Increase Data Sharing in Health and Medical Research? A Systematic Review”, Research Integrity and Peer Review , 2: 4. doi:10.1186/s41073-017-0028-9
  • Rudner, Richard, 1953, “The Scientist Qua Scientist Makes Value Judgments”, Philosophy of Science , 20(1): 1–6. doi:10.1086/287231
  • Sargent, C.L., 1981, “The Repeatability Of Significance And The Significance Of Repeatability”, European Journal of Parapsychology , 3: 423–433.
  • Schekman, Randy, 2013, “How Journals like Nature, Cell and Science Are Damaging Science | Randy Schekman”, The Guardian , December 9, sec. Opinion, [ Schekman 2013 available online ]
  • Schmidt, Stefan, 2009, “Shall We Really Do It Again? The Powerful Concept of Replication Is Neglected in the Social Sciences”, Review of General Psychology , 13(2): 90–100. doi:10.1037/a0015108
  • Silberzahn, Raphael,and Uhlmann, Eric L., 2015, “Many hands make tight work: crowdsourcing research can balance discussions, validate findings and better inform policy”, Nature , 526(7572): 189–192.
  • Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn, 2011, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, Psychological Science , 22(11): 1359–1366. doi:10.1177/0956797611417632
  • Smith, Daniel R., Ian C.W. Hardy, and Martin P. Gammell, 2011, “Power Rangers: No Improvement in the Statistical Power of Analyses Published in Animal Behaviour”, Animal Behaviour , 81(1): 347–352. doi:10.1016/j.anbehav.2010.09.026
  • Sovacool, B. K., 2008, “Exploring scientific misconduct: Isolated individuals, impure institutions, or an inevitable idiom of modern science?” Journal of Bioethical Inquiry , 5: 271–282. doi: 10.1007/s11673-008-9113-6
  • Steel, Daniel, 2010, “Epistemic Values and the Argument from Inductive Risk*”, Philosophy of Science , 77(1): 14–34. doi:10.1086/650206
  • Stegenga, Jacob, 2018, Medical Nihilism , Oxford: Oxford University Press.
  • Steinle, Friedrich, 2016, “Stability and Replication of Experimental Results: A Historical Perspective”, in Atmanspacher and Maasen 2016b: 39–68. doi:10.1002/9781118865064.ch3
  • Sterling, Theodore D., 1959, “Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance – or Vice Versa”, Journal of the American Statistical Association , 54(285): 30–34. doi:10.1080/01621459.1959.10501497
  • Sutton, Jon, 2018, “Tone Deaf?”, The Psychologist , 31: 12–13. [ Sutton 2018 available online ]
  • Szucs, Denes and John P. A. Ioannidis, 2017, “Empirical Assessment of Published Effect Sizes and Power in the Recent Cognitive Neuroscience and Psychology Literature”, Eric-Jan Wagenmakers (ed.), PLoS Biology , 15(3): e2000797. doi:10.1371/journal.pbio.2000797
  • Teira, David, 2013, “A Contractarian Solution to the Experimenter’s Regress”, Philosophy of Science , 80(5): 709–720. doi:10.1086/673717
  • Vazire, Simine, 2018, “Implications of the Credibility Revolution for Productivity, Creativity, and Progress”, Perspectives on Psychological Science , 13(4): 411–417. doi:10.1177/1745691617751884
  • Wagenmakers, Eric-Jan, Ruud Wetzels, Denny Borsboom, Han L. J. van der Maas, and Rogier A. Kievit, 2012, “An Agenda for Purely Confirmatory Research”, Perspectives on Psychological Science , 7(6): 632–638. doi:10.1177/1745691612463078
  • Washburn, Anthony N., Brittany E. Hanson, Matt Motyl, Linda J. Skitka, Caitlyn Yantis, Kendal M. Wong, Jiaqing Sun, et al., 2018, “Why Do Some Psychology Researchers Resist Adopting Proposed Reforms to Research Practices? A Description of Researchers’ Rationales”, Advances in Methods and Practices in Psychological Science , 1(2): 166–173. doi:10.1177/2515245918757427
  • Wasserstein, Ronald L. and Nicole A. Lazar, 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”, The American Statistician , 70(2): 129–133. doi:10.1080/00031305.2016.1154108
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Barba, Lorena A., 2017, “ Science Reproducibility Taxonomy ”, Presentation slides for the 2017 Workshop on Reproducibility Taxonomies for Computing and Computational Science .
  • Kelly, Clint, 2017, “Redux: Do Behavioral Ecologists Replicate Their Studies?”, presented at Ignite Session 12, Ecological Society of America, Portland, Oregon, 8 August. [ Kelly 2017 abstract available online ]
  • McShane, Blakeley B., David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett, 2018, “ Abandon Statistical Significance ”, arXiv.org, first version 22 September 2017; latest revision, 8 September 2018.
  • Schnall, Simone, 2014, “ Social Media and the Crowd-Sourcing of Social Psychology ”, Blog Department of Psychology, Cambridge University , November 18.
  • Tilburg University Meta-Research Center
  • Meta-Research Innovation Center at Stanford (METRICS)
  • The saga of the summer 2017, a.k.a. ‘the alpha wars’ , Barely Significant blog by Ladislas Nalborczyk.
  • 2017 American Statistical Association Symposium on Statistical Inference: Scientific Method for the 21 st Century: A World Beyond \(p <0.05\)
  • Improving Your Statistical Inferences, David Lakens, 2018, Coursera,
  • StudySwap: A Platform for Interlab Replication, Collaboration, and Research Resource Exchange , Open Science Framework
  • Collaborative Replications and Education Project (CREP) , Open Science Framework
  • Registered Reports: Peer review before results are known to align scientific values and practices , Center for Open Science

Bayes’ Theorem | epistemology: Bayesian | measurement: in science | operationalism | science: theory and observation in | scientific knowledge: social dimensions of | scientific method | scientific research and big data

Copyright © 2018 by Fiona Fidler < fidlerfm @ unimelb . edu . au > John Wilcox < wilcoxje @ stanford . edu >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Reproducibility vs Replicability | Difference & Examples

Reproducibility vs Replicability | Difference & Examples

Published on August 19, 2022 by Kassiani Nikolopoulou . Revised on June 22, 2023.

The terms reproducibility , repeatability , and replicability  are sometimes used interchangeably, but they mean different things.

  • A research study is reproducible when the existing data is reanalysed using the same research methods and yields the same results. This shows that the analysis was conducted fairly and correctly.
  • A research study is replicable (or repeatable ) when the entire research process is conducted again, using the same methods but new data, and still yields the same results. This shows that the results of the original study are reliable .

A survey of 60 children between the ages of 12 and 16 shows that football and hockey are the most popular sports. Football received 20 votes and hockey 18.

An independent researcher reanalyses the survey data and also finds that 20 children chose football and 18 children chose hockey. This makes the research reproducible.

Table of contents

Why reproducibility and replicability matter in research, what is the replication crisis, how to ensure reproducibility and replicability in your research, other interesting articles, frequently asked questions about reproducibility, replicability and repeatability.

Reproducibility and replicability enhance the reliability of results. This allows researchers to check the quality of their own work or that of others, which in turn increases the chance that the results are valid and not suffering from research bias .

On the other hand, reproduction alone does not show whether the results are correct. As it does not involve collecting new data , reproducibility is a minimum necessary condition – showing that findings are transparent and informative.

In order to make research reproducible, it is important to provide all the necessary raw data. This makes it so that anyone can run the analysis again, ideally recreating the same results. Omitted variables , missing data, or mistakes leading to information bias can lead to your research not being reproducible.

To make your research reproducible, you describe step by step how you collected and analysed your data. You also include all the raw data in the appendix : a list of the interview questions, the interview transcripts, and the coding sheet you used to analyse your interviews.

Sometimes researchers also conduct replication studies . These studies investigate whether researchers can arrive at the same scientific findings as an existing study while collecting new data and completing new analyses.  

The researchers administered an online survey, and the majority of the participants (58%) reported that they waste 10% or less of procured food. They also found that guilt and setting a good example were the main motivators for reducing food waste, rather than economic or environmental factors.

Together with your research group, you decide to conduct a replication study: you collect new data in the same state, this time via focus groups . Your findings are consistent with the initial study, which makes it replicable.

Overall, repeatability and reproducibility ensure that scientists remain honest and do not invent or distort results to get better outcomes. In particular, testing for reproducibility can also be a way to catch any mistakes, biases , or inconsistencies in your data.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Unfortunately, findings from many scientific fields – such as psychology, medicine, or economics – often prove impossible to replicate. When other research teams try to repeat a study, they get a different result, suggesting that the initial study’s findings are not reliable.

Some factors contributing to this phenomenon include:

  • Unclear definition of key terms
  • Poor description of research methods
  • Lack of transparency in the discussion section
  • Unclear presentation of raw data
  • Poor description of data analysis undertaken

Publication bias can also play a role. Scientific journals are more likely to accept original (non-replicated) studies that report positive, statistically significant results that support the hypothesis .

To make your research reproducible and replicable, it is crucial to describe, step by step, how to conduct the research. You can do so by focusing on writing a clear and transparent methodology section , using precise language and avoiding vague writing .

Transparent methodology section

In your methodology section, you explain in detail what steps you have taken to answer the research question. As a rule of thumb, someone who has nothing to do with your research should be able to repeat what you did based solely on your explanation.

For example, you can describe:

  • What type of research ( quantitative , qualitative , mixed methods ) you conducted
  • Which research method you used ( interviews , surveys , etc.)
  • Who your participants or respondents are (e.g., their age or education level)
  • What materials you used (audio clips, video recording, etc.)
  • What procedure you used
  • What data analysis method you chose (such as the type of statistical analysis )
  • How you ensured reliability and validity
  • Why you drew certain conclusions, and on the basis of which results
  • In which appendix the reader can find any survey questions, interviews, or transcripts

Sometimes, parts of the research may turn out differently than you expected, or you may accidentally make mistakes. This is all part of the process! It’s important to mention these problems and limitations so that they can be prevented next time. You can do this in the discussion or conclusion , depending on the requirements of your study program.

Use of clear and unambiguous language

You can also increase the reproducibility and replicability/repeatability of your research by always using crystal-clear writing. Avoid using vague language, and ensure that your text can only be understood in one way. Careful description shows that you have thought in depth about the method you chose and that you have confidence in the research and its results.

Here are a few examples.

  • The participants of this study were children from a school.
  • The 67 participants of this study were elementary school children between the ages of 6 and 10.
  • The interviews were transcribed and then coded.
  • The semi-structured interviews were first summarised, transcribed, and then open-coded.
  • The results were compared with a t test .
  • The results were compared with an unpaired t test.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Prevent plagiarism. Run a free check.

Reproducibility and replicability are related terms.

  • Reproducing research entails reanalyzing the existing data in the same manner.
  • Replicating (or repeating ) the research entails reconducting the entire analysis, including the collection of new data . 
  • A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
  • A successful replication shows that the reliability of the results is high.

The reproducibility and replicability of a study can be ensured by writing a transparent, detailed method section and using clear, unambiguous language.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Nikolopoulou, K. (2023, June 22). Reproducibility vs Replicability | Difference & Examples. Scribbr. Retrieved June 24, 2024, from https://www.scribbr.com/methodology/reproducibility-repeatability-replicability/

Is this article helpful?

Kassiani Nikolopoulou

Kassiani Nikolopoulou

Other students also liked, internal validity in research | definition, threats, & examples, the 4 types of validity in research | definitions & examples, reliability vs. validity in research | difference, types and examples, what is your plagiarism score.

National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 3 understanding reproducibility and replicability, 3 understanding reproducibility and replicability, the evolving practices of science.

Scientific research has evolved from an activity mainly undertaken by individuals operating in a few locations to many teams, large communities, and complex organizations involving hundreds to thousands of individuals worldwide. In the 17th century, scientists would communicate through letters and were able to understand and assimilate major developments across all the emerging major disciplines. In 2016—the most recent year for which data are available—more than 2,295,000 scientific and engineering research articles were published worldwide ( National Science Foundation, 2018e ).

In addition, the number of scientific and engineering fields and subfields of research is large and has greatly expanded in recent years, especially in fields that intersect disciplines (e.g., biophysics); more than 230 distinct fields and subfields can now be identified. The published literature is so voluminous and specialized that some researchers look to information retrieval, machine learning, and artificial intelligence techniques to track and apprehend the important work in their own fields.

Another major revolution in science came with the recent explosion of the availability of large amounts of data in combination with widely available and affordable computing resources. These changes have transformed many disciplines, enabled important scientific discoveries, and led to major shifts in science. In addition, the use of statistical analysis of data has expanded, and many disciplines have come to rely on complex and expensive instrumentation that generates and can automate analysis of large digital datasets.

Large-scale computation has been adopted in fields as diverse as astronomy, genetics, geoscience, particle physics, and social science, and has added scope to fields such as artificial intelligence. The democratization of data and computation has created new ways to conduct research; in particular, large-scale computation allows researchers to do research that was not possible a few decades ago. For example, public health researchers mine large databases and social media, searching for patterns, while earth scientists run massive simulations of complex systems to learn about the past, which can offer insight into possible future events.

Another change in science is an increased pressure to publish new scientific discoveries in prestigious and what some consider high-impact journals, such as Nature and Science. 1 This pressure is felt worldwide, across disciplines, and by researchers at all levels but is perhaps most acute for researchers at the beginning of their scientific careers who are trying to establish a strong scientific record to increase their chances of obtaining tenure at an academic institution and grants for future work. Tenure decisions have traditionally been made on the basis of the scientific record (i.e., published articles of important new results in a field) and have given added weight to publications in more prestigious journals. Competition for federal grants, a large source of academic research funding, is intense as the number of applicants grows at a rate higher than the increase in federal research budgets. These multiple factors create incentives for researchers

___________________

1 “High-impact” journals are viewed by some as those which possess high scores according to one of the several journal impact indicators such as Citescore, Scimago Journal Ranking (SJR), Source Normalized Impact per Paper (SNIP)—which are available in Scopus—and Journal Impact Factor (IF), Eigenfactor (EF), and Article Influence Score (AIC)—which can be obtained from the Journal Citation Report (JCR).

to overstate the importance of their results and increase the risk of bias—either conscious or unconscious—in data collection, analysis, and reporting.

In the context of these dynamic changes, the questions and issues related to reproducibility and replicability remain central to the development and evolution of science. How should studies and other research approaches be designed to efficiently generate reliable knowledge? How might hypotheses and results be better communicated to allow others to confirm, refute, or build on them? How can the potential biases of scientists themselves be understood, identified, and exposed in order to improve accuracy in the generation and interpretation of research results? How can intentional misrepresentation and fraud be detected and eliminated? 2

Researchers have proposed approaches to answering some of the questions over the past decades. As early as the 1960s, Jacob Cohen surveyed psychology articles from the perspective of statistical power to detect effect sizes, an approach that launched many subsequent power surveys (also known as meta-analyses) in the social sciences in subsequent years ( Cohen, 1988 ).

Researchers in biomedicine have been focused on threats to validity of results since at least the 1970s. In response to the threat, biomedical researchers developed a wide variety of approaches to address the concern, including an emphasis on randomized experiments with masking (also known as blinding), reliance on meta-analytic summaries over individual trial results, proper sizing and power of experiments, and the introduction of trial registration and detailed experimental protocols. Many of the same approaches have been proposed to counter shortcomings in reproducibility and replicability.

Reproducibility and replicability as they relate to data and computation-intensive scientific work received attention as the use of computational tools expanded. In the 1990s, Jon Claerbout launched the “reproducible research movement,” brought on by the growing use of computational workflows for analyzing data across a range of disciplines ( Claerbout and Karrenbach, 1992 ). Minor mistakes in code can lead to serious errors in interpretation and in reported results; Claerbout’s proposed solution was to establish an expectation that data and code will be openly shared so that results could be reproduced. The assumption was that reanalysis of the same data using the same methods would produce the same results.

In the 2000s and 2010s, several high-profile journal and general media publications focused on concerns about reproducibility and replicability (see, e.g., Ioannidis, 2005 ; Baker, 2016 ), including the cover story in The

2 See Chapter 5 , Fraud and Misconduct, which further discusses the association between misconduct as a source of non-replicability, its frequency, and reporting by the media.

Economist ( “How Science Goes Wrong,” 2013 ) noted above. These articles introduced new concerns about the availability of data and code and highlighted problems of publication bias, selective reporting, and misaligned incentives that cause positive results to be favored for publication over negative or nonconfirmatory results. 3 Some news articles focused on issues in biomedical research and clinical trials, which were discussed in the general media partly as a result of lawsuits and settlements over widely used drugs ( Fugh-Berman, 2010 ).

Many publications about reproducibility and replicability have focused on the lack of data, code, and detailed description of methods in individual studies or a set of studies. Several attempts have been made to assess non-reproducibility or non-replicability within a field, particularly in social sciences (e.g., Camerer et al., 2018 ; Open Science Collaboration, 2015 ). In Chapters 4 , 5 , and 6 , we review in more detail the studies, analyses, efforts to improve, and factors that affect the lack of reproducibility and replicability. Before that discussion, we must clearly define these terms.

DEFINING REPRODUCIBILITY AND REPLICABILITY

Different scientific disciplines and institutions use the words reproducibility and replicability in inconsistent or even contradictory ways: What one group means by one word, the other group means by the other word. 4 These terms—and others, such as repeatability—have long been used in relation to the general concept of one experiment or study confirming the results of another. Within this general concept, however, no terminologically consistent way of drawing distinctions has emerged; instead, conflicting and inconsistent terms have flourished. The difficulties in assessing reproducibility and replicability are complicated by this absence of standard definitions for these terms.

In some fields, one term has been used to cover all related concepts: for example, “replication” historically covered all concerns in political science ( King, 1995 ). In many settings, the terms reproducible and replicable have distinct meanings, but different communities adopted opposing definitions ( Claerbout and Karrenbach, 1992 ; Peng et al., 2006 ; Association for Computing Machinery, 2018 ). Some have added qualifying terms, such as methods reproducibility, results reproducibility, and inferential reproducibility to the lexicon ( Goodman et al., 2016 ). In particular, tension has emerged between the usage recently adopted in computer science and the way that

3 One such outcome became known as the “file drawer problem”: see Chapter 5 ; also see Rosenthal (1979) .

4 For the negative case, both “non-reproducible” and “irreproducible” are used in scientific work and are synonymous.

researchers in other scientific disciplines have described these ideas for years ( Heroux et al., 2018 ).

In the early 1990s, investigators began using the term “reproducible research” for studies that provided a complete digital compendium of data and code to reproduce their analyses, particularly in the processing of seismic wave recordings ( Claerbout and Karrenbach, 1992 ; Buckheit and Donoho, 1995 ). The emphasis was on ensuring that a computational analysis was transparent and documented so that it could be verified by other researchers. While this notion of reproducibility is quite different from situations in which a researcher gathers new data in the hopes of independently verifying previous results or a scientific inference, some scientific fields use the term reproducibility to refer to this practice. Peng et al. (2006 , p. 783) referred to this scenario as “replicability,” noting: “Scientific evidence is strengthened when important results are replicated by multiple independent investigators using independent data, analytical methods, laboratories, and instruments.” Despite efforts to coalesce around the use of these terms, lack of consensus persists across disciplines. The resulting confusion is an obstacle in moving forward to improve reproducibility and replicability ( Barba, 2018 ).

In a review paper on the use of the terms reproducibility and replicability, Barba (2018) outlined three categories of usage, which she characterized as A, B1, and B2:

B1 and B2 are in opposition of each other with respect to which term involves reusing the original authors’ digital artifacts of research (“research compendium”) and which involves independently created digital artifacts. Barba (2018) collected data on the usage of these terms across a variety of disciplines (see Table 3-1 ). 5

5 See also Heroux et al. (2018) for a discussion of the competing taxonomies between computational sciences (B1) and new definitions adopted in computer science (B2) and proposals for resolving the differences.

TABLE 3-1 Usage of the Terms Reproducibility and Replicability by Scientific Discipline

A B1 B2
Political Science Signal Processing Microbiology, Immunology (FASEB)
Economics Scientific Computing Computer Science (ACM)
Econometry
Epidemiology
Clinical Studies
Internal Medicine
Physiology (neurophysiology)
Computational Biology
Biomedical Research
Statistics

NOTES: See text for discussion. ACM = Association for Computing Machinery, FASEB = Federation of American Societies for Experimental Biology. SOURCE: Barba (2018, Table 2) .

The terminology adopted by the Association for Computing Machinery (ACM) for computer science was published in 2016 as a system for badges attached to articles published by the society. The ACM declared that its definitions were inspired by the metrology vocabulary, and it associated using an original author’s digital artifacts to “replicability,” and developing completely new digital artifacts to “reproducibility.” These terminological distinctions contradict the usage in computational science, where reproducibility is associated with transparency and access to the author’s digital artifacts, and also with social sciences, economics, clinical studies, and other domains, where replication studies collect new data to verify the original findings.

Regardless of the specific terms used, the underlying concepts have long played essential roles in all scientific disciplines. These concepts are closely connected to the following general questions about scientific results:

  • Are the data and analysis laid out with sufficient transparency and clarity that the results can be checked ?
  • If checked, do the data and analysis offered in support of the result in fact support that result?
  • If the data and analysis are shown to support the original result, can the result reported be found again in the specific study context investigated?
  • Finally, can the result reported or the inference drawn be found again in a broader set of study contexts ?

Computational scientists generally use the term reproducibility to answer just the first question—that is, reproducible research is research that is capable of being checked because the data, code, and methods of analysis are available to other researchers. The term reproducibility can also be used in the context of the second question: research is reproducible if another researcher actually uses the available data and code and obtains the same results. The difference between the first and the second questions is one of action by another researcher; the first refers to the availability of the data, code, and methods of analysis, while the second refers to the act of recomputing the results using the available data, code, and methods of analysis.

In order to answer the first and second questions, a second researcher uses data and code from the first; no new data or code are created by the second researcher. Reproducibility depends only on whether the methods of the computational analysis were transparently and accurately reported and whether that data, code, or other materials were used to reproduce the original results. In contrast, to answer question three, a researcher must redo the study, following the original methods as closely as possible and collecting new data. To answer question four, a researcher could take a variety of paths: choose a new condition of analysis, conduct the same study in a new context, or conduct a new study aimed at the same or similar research question.

For the purposes of this report and with the aim of defining these terms in ways that apply across multiple scientific disciplines, the committee has chosen to draw the distinction between reproducibility and replicability between the second and third questions. Thus, reproducibility includes the act of a second researcher recomputing the original results, and it can be satisfied with the availability of data, code, and methods that makes that recomputation possible. This definition of reproducibility refers to the transparency and reproducibility of computations: that is, it is synonymous with “computational reproducibility,” and we use the terms interchangeably in this report.

When a new study is conducted and new data are collected, aimed at the same or a similar scientific question as a previous one, we define it as a replication. A replication attempt might be conducted by the same investigators in the same lab in order to verify the original result, or it might be conducted by new investigators in a new lab or context, using the same or different methods and conditions of analysis. If this second study, aimed at the same scientific question but collecting new data, finds consistent results or can draw consistent conclusions, the research is replicable. If a second study explores a similar scientific question but in other contexts or

populations that differ from the original one and finds consistent results, the research is “generalizable.” 6

In summary, after extensive review of the ways these terms are used by different scientific communities, the committee adopted specific definitions for this report.

CONCLUSION 3-1: For this report, reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility,” and the terms are used interchangeably in this report.

Replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

Two studies may be considered to have replicated if they obtain consistent results given the level of uncertainty inherent in the system under study. In studies that measure a physical entity (i.e., a measurand), the results may be the sets of measurements of the same measurand obtained by different laboratories. In studies aimed at detecting an effect of an intentional intervention or a natural event, the results may be the type and size of effects found in different studies aimed at answering the same question. In general, whenever new data are obtained that constitute the results of a study aimed at answering the same scientific question as another study, the degree of consistency of the results from the two studies constitutes their degree of replication.

Two important constraints on the replicability of scientific results rest in limits to the precision of measurement and the potential for altered results due to sometimes subtle variation in the methods and steps performed in a scientific study. We expressly consider both here, as they can each have a profound influence on the replicability of scientific studies.

PRECISION OF MEASUREMENT

Virtually all scientific observations involve counts, measurements, or both. Scientific measurements may be of many different kinds: spatial dimensions (e.g., size, distance, and location), time, temperature, brightness, colorimetric properties, electromagnetic properties, electric current,

6 The committee definitions of reproducibility, replicability, and generalizability are consistent with the National Science Foundation’s Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science ( Bollen et al., 2015 ).

material properties, acidity, and concentration, to name a few from the natural sciences. The social sciences are similarly replete with counts and measures. With each measurement comes a characterization of the margin of doubt, or an assessment of uncertainty ( Possolo and Iyer, 2017 ). Indeed, it may be said that measurement, quantification, and uncertainties are core features of scientific studies.

One mark of progress in science and engineering has been the ability to make increasingly exact measurements on a widening array of objects and phenomena. Many of the things taken for granted in the modern world, from mechanical engines to interchangeable parts to smartphones, are possible only because of advances in the precision of measurement over time ( Winchester, 2018 ).

The concept of precision refers to the degree of closeness in measurements. As the unit used to measure distance, for example, shrinks from meter to centimeter to millimeter and so on down to micron, nanometer, and angstrom, the measurement unit becomes more exact and the proximity of one measurand to a second can be determined more precisely.

Even when scientists believe a quantity of interest is constant, they recognize that repeated measurement of that quantity may vary because of limits in the precision of measurement technology. It is useful to note that precision is different from the accuracy of a measurement system, as shown in Figure 3-1 , demonstrating the differences using an archery target containing three arrows.

In Figure 3-1 , A, the three arrows are in the outer ring, not close together and not close to the bull’s eye, illustrating low accuracy and low precision (i.e., the shots have not been accurate and are not highly precise). In B, the arrows are clustered in a tight band in an outer ring, illustrating

Image

low accuracy and high precision (i.e., the shots have been more precise, but not accurate). The other two figures similarly illustrate high accuracy and low precision (C) and high accuracy and high precision (D).

It is critical to keep in mind that the accuracy of a measurement can be judged only in relation to a known standard of truth. If the exact location of the bull’s eye is unknown, one must not presume that a more precise set of measures is necessarily more accurate; the results may simply be subject to a more consistent bias, moving them in a consistent way in a particular direction and distance from the true target.

It is often useful in science to describe quantitatively the central tendency and degree of dispersion among a set of repeated measurements of the same entity and to compare one set of measurements with a second set. When a set of measurements is repeated by the same operator using the same equipment under constant conditions and close in time, metrologists refer to the proximity of these measurements to one another as

measurement repeatability (see Box 3-1 ). When one is interested in comparing the degree to which the set of measurements obtained in one study are consistent with the set of measurements obtained in a second study, the committee characterizes this as a test of replicability because it entails the comparison of two studies aimed at the same scientific question where each obtained its own data.

Consider, for example, the set of measurements of the physical constant obtained over time by a number of laboratories (see Figure 3-2 ). For each laboratory’s results, the figure depicts the mean observation (i.e., the central tendency) and standard error of the mean, indicated by the error bars. The standard error is an indicator of the precision of the obtained measurements, where a smaller standard error represents higher precision. In comparing the measurements obtained by the different laboratories, notice that both the mean values and the degrees of precision (as indicated by the width of the error bars) may differ from one set of measurements to another.

Image

We may now ask what is a central question for this study: How well does a second set of measurements (or results) replicate a first set of measurements (or results)? Answering this question, we suggest, may involve three components:

  • proximity of the mean value (central tendency) of the second set relative to the mean value of the first set, measured both in physical units and relative to the standard error of the estimate
  • similitude in the degree of dispersion in observed values about the mean in the second set relative to the first set
  • likelihood that the second set of values and the first set of values could have been drawn from the same underlying distribution

Depending on circumstances, one or another of these components could be more salient for a particular purpose. For example, two sets of measures could have means that are very close to one another in physical units, yet each were sufficiently precisely measured as to be very unlikely to be

different by chance. A second comparison may find means are further apart, yet derived from more widely dispersed sets of observations, so that there is a higher likelihood that the difference in means could have been observed by chance. In terms of physical proximity, the first comparison is more closely replicated. In terms of the likelihood of being derived from the same underlying distribution, the second set is more highly replicated.

A simple visual inspection of the means and standard errors for measurements obtained by different laboratories may be sufficient for a judgment about their replicability. For example, in Figure 3-2 , it is evident that the bottom two measurement results have relatively tight precision and means that are nearly identical, so it seems reasonable these can be considered to have replicated one another. It is similarly evident that results from LAMPF (second from the top of reported measurements with a mean value and error bars in Figure 3-2 ) are better replicated by results from LNE-01 (fourth from top) than by measurements from NIST-89 (sixth from top). More subtle may be judging the degree of replication when, for example, one set of measurements has a relatively wide range of uncertainty compared to another. In Figure 3-2 , the uncertainty range from NPL-88 (third from top) is relatively wide and includes the mean of NIST-97 (seventh from top); however, the narrower uncertainty range for NIST-97 does not include the mean from NPL-88. Especially in such cases, it is valuable to have a systematic, quantitative indicator of the extent to which one set of measurements may be said to have replicated a second set of measurements, and a consistent means of quantifying the extent of replication can be useful in all cases.

VARIATIONS IN METHODS EMPLOYED IN A STUDY

When closely scrutinized, a scientific study or experiment may be seen to entail hundreds or thousands of choices, many of which are barely conscious or taken for granted. In the laboratory, exactly what size of Erlenmeyer flask is used to mix a set of reagents? At what exact temperature were the reagents stored? Was a drying agent such as acetone used on the glassware? Which agent and in what amount and exact concentration? Within what tolerance of error are the ingredients measured? When ingredient A was combined with ingredient B, was the flask shaken or stirred? How vigorously and for how long? What manufacturer of porcelain filter was used? If conducting a field survey, how exactly, were the subjects selected? Are the interviews conducted by computer or over the phone or in person? Are the interviews conducted by female or male, young or old, the same or different race as the interviewee? What is the exact wording of a question? If spoken, with what inflection? What is the exact sequence of questions? Without belaboring the point, we can

say that many of the exact methods employed in a scientific study may or may not be described in the methods section of a publication. An investigator may or may not realize when a possible variation could be consequential to the replicability of results.

In a later section, we will deal more generally with sources of non-replicability in science (see Chapter 5 and Box 5-2 ). Here, we wish to emphasize that countless subtle variations in the methods, techniques, sequences, procedures, and tools employed in a study may contribute in unexpected ways to differences in the obtained results (see Box 3-2 ).

Finally, note that a single scientific study may entail elements of the several concepts introduced and defined in this chapter, including computational reproducibility, precision in measurement, replicability, and generalizability or any combination of these. For example, a large epidemiological survey of air pollution may entail portable, personal devices to measure various concentrations in the air (subject to precision of measurement), very large datasets to analyze (subject to computational reproducibility), and a large number of choices in research design, methods, and study population (subject to replicability and generalizability).

RIGOR AND TRANSPARENCY

The committee was asked to “make recommendations for improving rigor and transparency in scientific and engineering research” (refer to Box 1-1 in Chapter 1 ). In response to this part of our charge, we briefly discuss the meanings of rigor and of transparency below and relate them to our topic of reproducibility and replicability.

Rigor is defined as “the strict application of the scientific method to ensure robust and unbiased experimental design” ( National Institutes of Health, 2018e ). Rigor does not guarantee that a study will be replicated, but conducting a study with rigor—with a well-thought-out plan and strict adherence to methodological best practices—makes it more likely. One of the assumptions of the scientific process is that rigorously conducted studies “and accurate reporting of the results will enable the soundest decisions” and that a series of rigorous studies aimed at the same research question “will offer successively ever-better approximations to the truth” ( Wood et al., 2019 , p. 311). Practices that indicate a lack of rigor, including poor study design, errors or sloppiness, and poor analysis and reporting, contribute to avoidable sources of non-replicability (see Chapter 5 ). Rigor affects both reproducibility and replicability.

Transparency has a long tradition in science. Since the advent of scientific reports and technical conferences, scientists have shared details about their research, including study design, materials used, details of the system under study, operationalization of variables, measurement techniques,

uncertainties in measurement in the system under study, and how data were collected and analyzed. A transparent scientific report makes clear whether the study was exploratory or confirmatory, shares information about what measurements were collected and how the data were prepared, which analyses were planned and which were not, and communicates the level of uncertainty in the result (e.g., through an error bar, sensitivity analysis, or p- value). Only by sharing all this information might it be possible for other researchers to confirm and check the correctness of the computations, attempt to replicate the study, and understand the full context of how to interpret the results. Transparency of data, code, and computational methods is directly linked to reproducibility, and it also applies to replicability. The clarity, accuracy, specificity, and completeness in the description of study methods directly affects replicability.

FINDING 3-1: In general, when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated.

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox , Microsoft Edge , Google Chrome , or Safari 14 or newer. If you are unable to, and need support, please send us your feedback .

We'd appreciate your feedback. Tell us what you think! opens in new tab/window

Reproducibility: why it matters and how it can be nurtured

October 26, 2021

By Torie Eva, Catriona Fennell, Katie Eve

ELSKnowledge wall

At Elsevier, we support reproducibility of research through a wide range of initiatives

When you are seeking out a new recipe, it’s good to know that others have tried it and produced similar results to the original chef. You know it’s  reproducible , and therefore you can  trust  the recipe. The same goes for research – but the stakes are astronomically higher.

Reproducibility – the repeatability of research findings to enable research and knowledge to progress – is vital and underpins trust in science. However, many challenges exist in relation to reproducing research. At Elsevier, we take our role as a steward of trust seriously; in this article, we explore reproducibility challenges and Elsevier’s activities to address them.

Complex challenges in reproducibility

Reproducibility faces many challenges, and they vary considerably between research fields.

While most researchers absolutely support the need for reproducible research, attitudes towards the practices that facilitate it can vary as can the incentives for those practices. Take data sharing: some researchers may simply be cautious about sharing their data openly or concerned about data misinterpretation. Meanwhile, in certain disciplines like medicine, there are barriers to data sharing relating to privacy, obtaining consent from patients and research subjects, and issues with data de-identification. Similarly, with negative/null results, researchers may not want to be associated with such a study, and this is compounded by reward systems that traditionally prioritize high-impact results.

Some disciplines and studies present unique challenges. For example: psychological studies can be more difficult to reproduce due to natural variations in human behavior and low statistical power, while  use of AI in research brings specific challenges opens in new tab/window . Researchers need to share code, software and computing details to reproduce computational experiments, for instance, as well as provide a substantial level of information on data and its provenance, descriptions of methods, and detailed specifications of hyper-parameters used to generate results.

Also, the production of research itself is often subject to natural variation and honest human errors: experimental mistakes or wrong conclusions are an inherent risk when conducting research. For example, an unstable reagent or contaminated sample can yield differing results. We should avoid stigmatizing researchers if certain studies cannot be easily reproduced, given that it’s rare for this to be caused by a researcher falsifying their results with the intention to deceive. If we fail to do this, authors may become less willing to share their data and methods with others and less inclined to alert the journal if they are unable to repeat their own previous work.

Lastly, the questions researchers seek to address are increasingly complex, and the corresponding studies are potentially expensive to design and implement, which again has knock-on effects for reproducibility.

Reproducibility efforts at Elsevier

It is evident that responsibility for reproducibility does not land at the feet of a single stakeholder group. It requires collaboration among funders, institutions, publishers and researchers to provide and fund education and incentives, set standards and change behaviors on reproducibility of research across the research process, from idea to publication.

A prerequisite for reproducibility is full transparency in how studies and experiments were undertaken, the methods and protocols involved, and the results obtained. We draw here on our experience with new approaches Elsevier has trialed to nurture reproducibility and thereby maintain the integrity of the scholarly record.

Supporting transparent methodology

Our journals incentivize researchers to be transparent in their methodology, which in turn supports reproducibility, allowing others to follow a clear and reproducible method. In this vein,  Cell Press opens in new tab/window  launched  STAR Methods opens in new tab/window , which defines the features of a robust, reproducible method: structured, transparent, accessible reporting. STAR Methods has proved to be highly successful, and key elements have already been rolled out to 1,500 journals.

We also offer entire journals dedicated to methods transparency.  STAR Protocols  opens in new tab/window publishes complete, authoritative and consistent instructions on how to conduct experiments.  MethodsX  opens in new tab/window publishes small but important customizations to methods. Such journals not only give authors an incentive to share this information but also help other researchers be more efficient.

Encouraging open research data sharing

Research data is the foundation on which scientific knowledge is built. Access to the research data that underpins published findings helps ensure the research can be successfully reproduced. As a key pillar of open science, and to uphold research integrity, we promote and encourage  open research data sharing practices and incentives for authors. These include:

Providing products and platforms to incentivize researchers to share their research data in a way that is structured but easy for researchers to deploy. Examples include 

Mendeley Data opens in new tab/window ,  Digital Commons and  Entellect .

Promoting FAIR data principles to ensure that data is findable, accessible, interoperable and reusable. We are a founding member of  Force11 opens in new tab/window , which developed the FAIR principles.

Implementing journal  policies that encourage researchers to share research data transparently or provide a data availability statement .

Integrating data sharing into our submission workflows, thereby using our journals to incentivize researchers to data share. For example, we encourage data sharing at the point a researcher submits their paper to our journals. We found that simply making it easy for authors to data share, and reminding them early in the publication process, doubled the amount of data sharing, supporting reproducibility.

Investing in journals that publish data output, such as  SoftwareX opens in new tab/window and  Data in Brief opens in new tab/window .

Peer review innovation

Initiatives such as  Registered Reports  and  Results Masked Review  aim for work to be judged on the merits of the research question and methodology, not the findings.

Registered Reports  requires authors to submit and commit to their protocols before experiments are conducted. The journal then accepts the paper in principle, based on whether editors believe the protocol has merit, and commits to publishing the research regardless of the results.

With  Results Masked Review , the experiments have already taken place, but the reviewers are first sent the paper with the results masked. Both of these models prevent publication bias and enhance transparency, thereby ensuring that results aren’t skewed in pursuit of publication.

Inclusion & diversity

Finally, inclusion and diversity in research is crucial for reproducibility. For example, if clinical trials are carried out only on men, there is a strong likelihood of divergence if they are repeated with women.

This has clear implications for society: products of research must take diversity into account. Elsevier has undertaken a range of  activities  to support inclusion and diversity in science, including developing a  Gender Equality resource center  providing free access to research, data, and tools related to gender; supporting gender balance across our editorial boards and the research community; and increasing awareness of these issues through our  gender reports . Furthermore, publishers’ management of the peer review process includes providing tools and information to help editors to find the most relevant reviewers and increase diversity in the peer reviewer pool.  As we recently explained , expertise and diversity in the peer review process reduces bias and increases scientific rigor, in turn enhancing reproducibility.

Looking ahead: a stakeholder collaboration

For many years, Elsevier has been experimenting with a range of solutions to meet reproducibility challenges. However, our pilots are subject to differing results and degrees of success. For instance, journals publishing negative results and replication studies have so far seen low take-up from researchers, while our data journals have been highly successful.

We will nevertheless continue our work to promote reproducibility. In terms of data sharing, for example, our ambition is that during the course of 2022, we will require authors to link to their datasets or provide data availability statements across the majority of our journals. We continue to test and learn from the results of our journal and article pilots, while promoting our ongoing innovative projects like  Registered Reports .

However, publishers are but one stakeholder, and reproducibility practices also need to be promoted by other parts of the research ecosystem. By their nature, journals operate far along in the research process and are therefore limited in their ability to encourage reproducibility at critical earlier junctures.

While some researchers are already championing reproducibility, the research community as a whole needs education, rewards and incentives to embark on practices that encourage reproducibility from the outset of their research projects. Stakeholders who operate at these earlier stages of research development, including funders and institutions, have important opportunities to influence and appropriately fund and incentivize researchers on reproducibility, complementary to the role of journals.

As outlined in  the manifesto for reproducible science opens in new tab/window , stakeholders across the research community must work together to address reproducibility challenges, collaborating and aligning to build a positive research culture that rewards and integrates reproducibility practices. This could include agreeing on standards that support reproducibility, developing incentives – or even mandates – for authors to share data, and encouraging researchers to publish negative and null results.

We should examine fundamental questions, including:

What would be gained if research was fully reproducible?

What changes regarding research, publishing incentives and infrastructure would be required to make this possible, and which of these would have the greatest impact?

How can we resolve ongoing researcher concerns around practices such as data sharing?

We know we have more work to do to ensure the research we publish is as robust and reproducible as possible. We will continue this endeavor via our journals and services, and we look forward to partnering with the research community as part of this process.

Contributors

Image of Torie Eva

Vice President of Global Policy

Catriona Fennell

Catriona Fennell

Headshot of Katie Eve

We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Informatics

Stay up to date on the topics that matter to you

Repeatability vs. Reproducibility

RJ Mackenzie image

Complete the form below to unlock access to ALL audio articles.

In measuring the quality of experiments, repeatability and reproducibility are key. In this article, we explore the differences between the two terms, and why they are important in determining the worth of published research. 

What is repeatability?

Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that exact same result. It’s a way for researchers to verify that their own results are true and are not just chance artifacts.

To demonstrate a technique’s repeatability, the conditions of the experiment must be kept the same. These include: 

  • Measuring tools
  • Other apparatus used in the experiment
  • Time period (taking month-long breaks between repetitions isn’t good practice) 

Bland and Altman authored an extremely useful paper in 1986 which highlighted one benefit of assessing repeatability: it allows one to make comparisons between different methods of measurement.

Previous studies had used similarities in the correlation coefficient ( r ) between techniques as an indicator of agreement. Bland and Altman showed that r actually measured the strength of the relation between two techniques, not the extent to which they agree with each other. This means r is quite meaningless in this context; if two different techniques were both designed to measure heart rate, it would be bizarre if they weren’t related to each other!

What is reproducibility?

The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows that the results obtained are not artifacts of the unique setup in one research lab. It’s easy to see why reproducibility is desirable, as it reinforces findings and protects against rare cases of fraud, or less rare cases of human error, in the production of significant results.

Why are repeatability and reproducibility important?

Science is a method built on an approach of gradual advance backed up by independent verification and the ability to show that your findings are correct and transparent. Academic research findings are only useful to the wider scientific community if the knowledge can be repeated and shared among research groups. As such, irreproducible and unrepeatable studies are the source of much concern within science. 

What is the reproducibility crisis?

Over recent decades, science, in particular the social and life sciences, has seen increasing importance placed on the reproducibility of published studies. Large-scale efforts to assess the reproducibility of scientific publications have turned up worrying results. For example, a 2015 paper by a group of psychology researchers dubbed the “Open Science Collaboration” examined 100 experiments published in high-ranking, peer-reviewed journals. Of these 100 studies, just 68 reproductions provided statistically significant results that matched the original findings. These efforts are part of a growing field of “ metascience ” that aims to take on the reproducibility crisis.

But what about replicability? 

How can we improve reproducibility.

A lot of thought is being put into improving experimental reproducibility. Below are just some of the ways you can improve reproducibility:

  • Journal checklists – more and more journals are coming to understand the importance of including all relevant details in published studies, and are bringing in mandatory checklists for any published papers. The days of leaving out sample numbers and animal model descriptions in methods sections are over and blinding and randomization should be standard where possible.
  • Strict on stats – power calculations, multiple comparisons tests and descriptive statistics are all essential to making sure that reported results are statistically sound. 
  • Technology can lend a hand – automating processes and using high-throughput systems can improve accuracy in individual experiments, and enable more measurements to be taken in a given time, increasing sample numbers. 

What is repeatability? Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that exact same result. It’s a way for researchers to verify that their own results are true and are not just chance artifacts.

What is reproducibility? The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows that the results obtained are not artifacts of the unique setup in one research lab.

What is the reproducibility crisis? Over recent decades, science, in particular the social and life sciences, has seen increasing importance placed on the reproducibility of published studies. Large-scale efforts to assess the reproducibility of scientific publications have turned up worrying results.

How can we improve reproducibility? A lot of thought is being put into improving experimental reproducibility. Below are just some of the ways you can improve reproducibility: Journal checklists – more and more journals are coming to understand the importance of including all relevant details in published studies, and are bringing in mandatory checklists for any published papers. The days of leaving out sample numbers and animal model descriptions in methods sections are over and blinding and randomization should be standard where possible. Strict on stats – power calculations, multiple comparisons tests and descriptive statistics are all essential to making sure that reported results are statistically sound. Technology can lend a hand – automating processes and using high-throughput systems can improve accuracy in individual experiments, and enable more measurements to be taken in a given time, increasing sample numbers.

RJ Mackenzie image

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 June 2024

Parallel experiments in electrochemical CO 2 reduction enabled by standardized analytics

  • Alessandro Senocrate   ORCID: orcid.org/0000-0002-0952-0948 1 , 2 ,
  • Francesco Bernasconi   ORCID: orcid.org/0000-0002-6563-0578 1 , 3 ,
  • Peter Kraus   ORCID: orcid.org/0000-0002-4359-5003 1 ,
  • Nukorn Plainpan 1 ,
  • Jens Trafkowski 4 ,
  • Fabian Tolle   ORCID: orcid.org/0000-0002-4221-0167 4 ,
  • Thomas Weber 4 ,
  • Ulrich Sauter   ORCID: orcid.org/0009-0000-7623-8697 1 &
  • Corsin Battaglia 1 , 2 , 3 , 5  

Nature Catalysis volume  7 ,  pages 742–752 ( 2024 ) Cite this article

293 Accesses

65 Altmetric

Metrics details

  • Analytical chemistry
  • Electrocatalysis

Electrochemical CO 2 reduction (eCO 2 R) is a promising strategy to transform detrimental CO 2 emissions into sustainable fuels and chemicals. Key requirements for advancing this field are the development of analytical systems and of methods that are able to accurately and reproducibly assess the performance of catalysts, electrodes and electrolysers. Here we present a comprehensive analytical system for eCO 2 R based on commercial hardware, which captures data for >20 gas and liquid products with <5 min time resolution by chromatography, tracks gas flow rates, monitors electrolyser temperatures and flow pressures, and records electrolyser resistances and electrode surface areas. To complement the hardware, we develop an open-source software that automatically parses, aligns in time and post-processes the heterogeneous data, yielding quantities such as Faradaic efficiencies and corrected voltages. We showcase the system’s capabilities by performing measurements and data analysis on eight parallel electrolyser cells simultaneously.

reproducible experiment results

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

reproducible experiment results

Similar content being viewed by others

reproducible experiment results

Electrochemical methods for carbon dioxide separations

reproducible experiment results

Materials challenges on the path to gigatonne CO2 electrolysis

reproducible experiment results

Gas diffusion electrodes, reactor designs and key metrics of low-temperature CO2 electrolysers

Data availability.

Data used in this manuscript are freely available via Zenodo at https://doi.org/10.5281/zenodo.8319625 (ref. 54 ). Data for composing Fig. 3 are also part of an interactive example of automated data parsing and processing that can be accessed via Zenodo at https://doi.org/10.5281/zenodo.7941528 (ref. 32 ).

Code availability

The code used in this work is fully open source and available at https://dgbowl.github.io/ (ref. 33 ).

Delbeke, J., Runge-Metzger, A., Slingenberg, Y. & Werksman, J. in Towards a Climate-Neutral Europe (eds Delbecke, J. & Vis, P.) Ch. 2 (Routledge, 2019).

UNFCCC. 26th UN Climate Change Conference of the Parties 2021 - Glasgow Climate Pact (United Nations, 2022); https://unfccc.int/sites/default/files/resource/cma2021_10_add1_adv.pdf

Chatterjee, T., Boutin, E. & Robert, M. Manifesto for the routine use of NMR for the liquid product analysis of aqueous CO 2 reduction: from comprehensive chemical shift data to formaldehyde quantification in water. Dalton Trans. 49 , 4257–4265 (2020).

Article   CAS   PubMed   Google Scholar  

Zhang, J., Luo, W. & Züttel, A. Crossover of liquid products from electrochemical CO 2 reduction through gas diffusion electrode and anion exchange membrane. J. Catal. 385 , 140–145 (2020).

Article   CAS   Google Scholar  

Lum, Y. & Ager, J. W. Evidence for product-specific active sites on oxide-derived Cu catalysts for electrochemical CO 2 reduction. Nat. Catal. 2 , 86–93 (2019).

Dinh, C. T. et al. CO 2 electroreduction to ethylene via hydroxide-mediated copper catalysis at an abrupt interface. Science 360 , 783–787 (2018).

Birdja, Y. Y. & Vaes, J. Towards a critical evaluation of electrocatalyst stability for CO 2 electroreduction. ChemElectroChem 7 , 4713–4717 (2020).

Birdja, Y. Y. et al. Effects of substrate and polymer encapsulation on CO 2 electroreduction by immobilized indium(III) protoporphyrin. ACS Catal. 8 , 4420–4428 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Deng, W. et al. Crucial role of surface hydroxyls on the activity and stability in electrochemical CO 2 reduction. J. Am. Chem. Soc. 141 , 2911–2915 (2019).

Choi, Y. W., Scholten, F., Sinev, I. & Cuenya, B. R. Enhanced stability and CO/formate selectivity of plasma-treated SnO x /AgO x catalysts during CO 2 electroreduction. J. Am. Chem. Soc. 141 , 5261–5266 (2019).

Kaneco, S. et al. Electrochemical conversion of carbon dioxide to methane in aqueous NaHCO 3 solution at less than 273 K. Electrochim. Acta 48 , 51–55 (2002).

Varela, A. S. et al. CO 2 electroreduction on well-defined bimetallic surfaces: Cu overlayers on Pt(111) and Pt(211 ) . J. Phys. Chem. C 117 , 20500–20508 (2013).

Ju, W. et al. Understanding activity and selectivity of metal-nitrogen-doped carbon catalysts for electrochemical reduction of CO 2 . Nat. Commun. 8 , 9441 (2017).

Article   Google Scholar  

Li, A., Wang, H., Han, J. & Liu, L. Preparation of a Pb loaded gas diffusion electrode and its application to CO 2 electroreduction. Front. Chem. Sci. Eng. 6 , 381–388 (2012).

Guzmán, H. et al. Investigation of gas diffusion electrode systems for the electrochemical CO 2 conversion. Catalysts 11 , 482 (2021).

Kortlever, R., Peters, I., Koper, S. & Koper, M. T. M. Electrochemical CO 2 reduction to formic acid at low overpotential and with high Faradaic efficiency on carbon-supported bimetallic Pd–Pt nanoparticles. ACS Catal. 5 , 3916–3923 (2015).

Blom, M. J. W., Smulders, V., van Swaaij, W. P. M., Kersten, S. R. A. & Mul, G. Pulsed electrochemical synthesis of formate using Pb electrodes. Appl. Catal. B Environ. 268 , 118420 (2020).

Larrazábal, G. O. et al. Analysis of mass flows and membrane cross-over in CO 2 reduction at high current densities in an MEA-type electrolyzer. ACS Appl. Mater. Interfaces 11 , 41281–41288 (2019).

Article   PubMed   Google Scholar  

Ma, M. et al. Insights into the carbon balance for CO 2 electroreduction on Cu using gas diffusion electrode reactor designs. Energy Environ. Sci. 13 , 977–985 (2020).

Patra, K. K. et al. Boosting electrochemical CO 2 reduction to methane via tuning oxygen vacancy concentration and surface termination on a copper/ceria catalyst. ACS Catal. 12 , 10973–10983 (2022).

An, X. et al. Electrodeposition of tin-based electrocatalysts with different surface tin species distributions for electrochemical reduction of CO 2 to HCOOH. ACS Sustain. Chem. Eng. 7 , 9360–9368 (2019).

Bejtka, K. et al. Chainlike mesoporous SnO 2 as a well-performing catalyst for electrochemical CO 2 reduction. ACS Appl. Energy Mater. 2 , 3081–3091 (2019).

Khanipour, P. et al. Electrochemical real‐time mass spectrometry (EC‐RTMS): monitoring electrochemical reaction products in real time. Angew. Chem. Int. Edn Engl. 131 , 7219–7219 (2019).

Lobaccaro, P. et al. Initial application of selected-ion flow-tube mass spectrometry to real-time product detection in electrochemical CO 2 reduction. Energy Technol. 6 , 110–121 (2018).

Clark, E. L., Singh, M. R., Kwon, Y. & Bell, A. T. Differential electrochemical mass spectrometer cell design for online quantification of products produced during electrochemical reduction of CO 2 . Anal. Chem. 87 , 8013–8020 (2015).

Wang, X. et al. Mechanistic reaction pathways of enhanced ethylene yields during electroreduction of CO 2 –CO co-feeds on Cu and Cu-tandem electrocatalysts. Nat. Nanotechnol. 14 , 1063–1070 (2019).

Zhang, G., Cui, Y. & Kucernak, A. Real-time in situ monitoring of CO 2 electroreduction in the liquid and gas phases by coupled mass spectrometry and localized electrochemistry. ACS Catal. 12 , 6180–6190 (2022).

Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3 , 160018 (2016).

Article   PubMed   PubMed Central   Google Scholar  

Reinisch, D. et al. Various CO 2 -to-CO electrolyzer cell and operation mode designs to avoid CO 2 -crossover from cathode to anode. Z. Phys. Chem. 234 , 1115–1131 (2019).

Möller, T. et al. The product selectivity zones in gas diffusion electrodes during the electrocatalytic reduction of CO 2 . Energy Environ. Sci. 14 , 5995–6006 (2021).

Kwon, Y. & Koper, M. T. M. Combining voltammetry with HPLC: application to electro-oxidation of glycerol. Anal. Chem. 82 , 5420–5424 (2010).

Senocrate, A., Bernasconi, F., Kraus, P., Sauter, U. & Battaglia, C. Instructions and tutorial for publication 'Parallel experiments in electrochemical CO 2 reduction enabled by standardized analytics' . Zenodo https://doi.org/10.5281/zenodo.7941528 (2024).

dgbowl Development Team. dgbowl: tools for digital (electro-)catalysis and battery materials research. GitHub https://dgbowl.github.io/ (2024).

Kraus, P. & Vetsch, N. yadg: yet another datagram (4.2.3). Zenodo https://doi.org/10.5281/zenodo.7898175 (2023).

Kraus, P. & Sauter, U. dgpost: datagram post-processing toolkit (2.1) https://doi.org/10.5281/zenodo.7898183 (2023).

Kraus, P. et al. Towards automation of operando experiments: a case study in contactless conductivity measurements. Digit. Discov. 1 , 241–254 (2022).

Kraus, P., Vetsch, N. & Battaglia, C. yadg: yet another datagram. J. Open Source Softw. 7 , 4166 (2022).

The pandas development team. pandas-dev/pandas: Pandas (v1.5.3). Zenodo https://doi.org/10.5281/zenodo.7549438 (2023).

Grecco, H. E. & Chéron, J. Pint: Makes Units Easy. GitHub https://github.com/hgrecco/pint (2021).

Lebigot, E. O. Python Uncertainties Package. GitHub https://github.com/lmfit/uncertainties (2022).

dgpost Authors. dgpost: datagram post-processing toolkit—project documentation. GitHub https://dgbowl.github.io/dgpost/master/index.html (2023).

Seger, B., Robert, M. & Jiao, F. Best practices for electrochemical reduction of carbon dioxide. Nat. Sustain. 6 , 236–238 (2023).

Ma, M., Zheng, Z., Yan, W., Hu, C. & Seger, B. Rigorous evaluation of liquid products in high-rate CO 2 /CO electrolysis. ACS Energy Lett. 7 , 2595–2601 (2022).

Kong, Y. et al. Cracks as efficient tools to mitigate flooding in gas diffusion electrodes used for the electrochemical reduction of carbon dioxide. Small Methods 6 , 2200369 (2022).

Wu, Y. et al. Mitigating electrolyte flooding for electrochemical CO 2 reduction via infiltration of hydrophobic particles in a gas diffusion layer. ACS Energy Lett. 7 , 2884–2892 (2022).

Rabinowitz, J. A. & Kanan, M. W. The future of low-temperature carbon dioxide electrolysis depends on solving one basic problem. Nat. Commun. 11 , 5231 (2020).

Ma, M. et al. Local reaction environment for selective electroreduction of carbon monoxide. Energy Environ. Sci. 15 , 2470–2478 (2022).

Friedmann, T. A., Siegal, M. P., Tallant, D. R., Simpson, R. L. & Dominguez, F. Residual stress and Raman spectra of laser deposited highly tetrahedral-coordinated amorphous carbon films. MRS Proc. 349 , 501–506 (1994).

Wheeler, D. G. et al. Quantification of water transport in a CO 2 electrolyzer. Energy Environ. Sci. 13 , 5126–5134 (2020).

DeWulf, D. W., Jin, T. & Bard, A. J. Electrochemical and surface studies of carbon dioxide reduction to methane and ethylene at copper electrodes in aqueous solutions. J. Electrochem. Soc. 136 , 1686–1691 (1989).

Wuttig, A. & Surendranath, Y. Impurity ion complexation enhances carbon dioxide reduction catalysis. ACS Catal. 5 , 4479–4484 (2015).

Senocrate, A. et al. Importance of substrate pore size and wetting behavior in gas diffusion electrodes for CO 2 reduction. ACS Appl. Energy Mater. 5 , 14504–14512 (2022).

Bernasconi, F., Senocrate, A., Kraus, P. & Battaglia, C. Enhancing C ≥2 product selectivity in electrochemical CO 2 reduction by controlling the microstructure of gas diffusion electrodes. EES Catal. 1 , 1009–1016 (2023).

Senocrate, A. et al. Dataset for publication 'Parallel experiments in electrochemical CO2 reduction enabled by standardized analytics'. Zenodo https://doi.org/10.5281/zenodo.8319624 (2023).

Download references

Acknowledgements

This work has received funding from the ETH Board in the framework of the Joint Strategic Initiative ‘Synthetic Fuels from Renewable Resources’. This work was also supported by the NCCR Catalysis, a National Centre of Competence in Research funded by the Swiss National Science Foundation (grant no. 180544). We further acknowledge support by the Open Research Data Program of the ETH Board (project ‘PREMISE’: Open and Reproducible Materials Science Research). A.S. acknowledges funding from the Swiss National Science Foundation through the Ambizione grant PZ00P2_215992. We thank C. Spitz, S. Holmann and M. Maier from Agilent Technologies (Switzerland) for support with validating the chromatographic method. We thank N. Vetsch for help in coding the electrochemical data parser, and J. Viloria for support during electrochemical experiments. E. Querel is acknowledged for help with the ICP measurements. We also acknowledge the support of the Scientific Center for Optical and Electron Microscopy (ScopeM) of the ETH Zurich and of P. Zeng of ScopeM for the focused ion beam-SEM results. We also thank M. Mirolo of beamline ID31 at the European Synchrotron Radiation Facility (ESRF) for support with the synchrotron X-ray measurements.

Author information

Authors and affiliations.

Empa, Swiss Federal Laboratories for Materials Science and Technology, Dübendorf, Switzerland

Alessandro Senocrate, Francesco Bernasconi, Peter Kraus, Nukorn Plainpan, Ulrich Sauter & Corsin Battaglia

ETH Zürich, Department of Information Technology and Electrical Engineering, Zürich, Switzerland

Alessandro Senocrate & Corsin Battaglia

ETH Zürich, Department of Materials, Zürich, Switzerland

Francesco Bernasconi & Corsin Battaglia

Agilent Technologies (Switzerland), Basel, Switzerland

Jens Trafkowski, Fabian Tolle & Thomas Weber

EPFL, School of Engineering, Institute of Materials, Lausanne, Switzerland

Corsin Battaglia

You can also search for this author in PubMed   Google Scholar

Contributions

A.S. designed, validated and assembled the hardware, performed the main electrochemical experiments and wrote the manuscript with input from all co-authors. F.B. contributed to the electrochemical experiments, validation of the method, assembly of the hardware and acquisition of the SEM images. P.K. wrote the open-source software and contributed to the data analysis. N.P. supported the data analysis effort and wrote the script required to analyse data from parallel cells. J.T., F.T. and T.W. supported the implementation of the online liquid sampling and liquid analysis. U.S. helped writing and debugging the open-source software. C.B. supervised the development of the project.

Corresponding author

Correspondence to Alessandro Senocrate .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Catalysis thanks Joel Ager III, Zhihao Cui and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Notes 1–10, Figs. 1–21 and Tables 1–11.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Senocrate, A., Bernasconi, F., Kraus, P. et al. Parallel experiments in electrochemical CO 2 reduction enabled by standardized analytics. Nat Catal 7 , 742–752 (2024). https://doi.org/10.1038/s41929-024-01172-x

Download citation

Received : 11 August 2023

Accepted : 01 May 2024

Published : 26 June 2024

Issue Date : June 2024

DOI : https://doi.org/10.1038/s41929-024-01172-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

reproducible experiment results

reproducible experiment results

Environmental Science: Water Research & Technology

Molecular level seasonality of dissolved organic matter in freshwater and its impact on drinking water treatment †.

ORCID logo

* Corresponding authors

a Department of Thematic Studies – Environmental Change, Linköping University, SE-581 83 Linköping, Sweden E-mail: [email protected]

b Department of Chemistry, State University of New York College of Environmental Science and Forestry, Syracuse, New York 13210, USA

c Research Unit Analytical BioGeoChemistry, Helmholtz Munich, Ingolstaedter Landstraße 1, 85764 Neuherberg, Germany

d Chair of Analytical Food Chemistry, Technical University of Munich, 85354 Freising, Germany

e University of Maryland Center for Environmental Science, Chesapeake Biological Laboratory, Solomons, Maryland 20688, USA

f Research Unit: Environmental Sciences and Management, North-West University, Potchefstroom, South Africa

g Norrvatten, Kvalitet och Utveckling, SE-169 02 Solna, Sweden

h Nodra, Borgs vattenverk, SE-603 36 Norrköping, Sweden

Improved characterization of dissolved organic matter (DOM) in source waters used for drinking water treatment is necessary to optimize treatment processes and obtain high drinking water quality. In this study, seasonal differences in freshwater DOM composition and associated treatment-induced changes, were investigated at four drinking water treatment plants (DWTPs) in Sweden, during all seasons and a full-year. The objective was to understand how effectively DWTPs can adapt to seasonal changes and compare how optical and mass spectrometry methods detected these changes. In addition to bulk DOM analysis, this work focused on excitation–emission matrix (EEM) fluorescence including parallel factor (PARAFAC) analysis, and molecular level non-target analysis by Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS). Overall, seasonal variability of raw water DOM composition was small and explained primarily by changes in the contributions of DOM with aromatic and phenolic moieties, which were more prevalent during spring in two surface water sources as indicated by absorbance measurements at 254 nm, computed specific ultraviolet absorbance (SUVA) and phenol concentrations. These changes could be balanced by coagulation, resulting in seasonally stable DOM characteristics of treated water. While EEM fluorescence and PARAFAC modelling effectively revealed DOM fingerprints of the different water sources, FT-ICR MS provided new insights into treatment selectivity on DOM composition at the molecular level. Future DOM monitoring of surface waters should target more specific seasonal DOM changes, such as features with a known impact on certain treatment processes or target certain events, like algal or cyanobacterial blooms.

Graphical abstract: Molecular level seasonality of dissolved organic matter in freshwater and its impact on drinking water treatment

Supplementary files

  • Supplementary information PDF (4569K)

Article information

reproducible experiment results

Download Citation

Permissions.

reproducible experiment results

Molecular level seasonality of dissolved organic matter in freshwater and its impact on drinking water treatment

A. Andersson, L. Powers, M. Harir, M. Gonsior, N. Hertkorn, P. Schmitt-Kopplin, H. Kylin, D. Hellström, Ä. Pettersson and D. Bastviken, Environ. Sci.: Water Res. Technol. , 2024, Advance Article , DOI: 10.1039/D4EW00142G

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence . You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content .

Social activity

Search articles by author.

This article has not yet been cited.

Advertisements

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Infect Immun
  • v.78(12); 2010 Dec

Logo of iai

Reproducible Science ▿

Arturo casadevall.

Editor in Chief, mBio Departments of Microbiology & Immunology and Medicine Albert Einstein College of Medicine, Bronx, New York

Editor in Chief, Infection and Immunity Departments of Laboratory Medicine and Microbiology University of Washington School of Medicine, Seattle, Washington

Ferric C. Fang

The reproducibility of an experimental result is a fundamental assumption in science. Yet, results that are merely confirmatory of previous findings are given low priority and can be difficult to publish. Furthermore, the complex and chaotic nature of biological systems imposes limitations on the replicability of scientific experiments. This essay explores the importance and limits of reproducibility in scientific manuscripts.

“Non-reproducible single occurrences are of no significance to science.” —Karl Popper ( 18 )

There may be no more important issue for authors and reviewers than the question of reproducibility, a bedrock principle in the conduct and validation of experimental science. Consequently, readers, reviewers, and editors of Infection and Immunity can rightfully expect to see information regarding the reproducibility of experiments in the pages of this journal. Articles may describe findings with a statement that an experiment was repeated a specific number of times, with similar results. Alternatively, depending upon the nature of the experiment, the results from multiple experimental replicates might be presented individually or in combined fashion, along with an indication of experiment-to-experiment variability. For most types of experiment, there is an unstated requirement that the work be reproducible, at least once, in an independent experiment, with a strong preference for reproducibility in at least three experiments. The assumption that experimental findings are reproducible is a key criterion for acceptance of a manuscript, and the Instructions to Authors insist that “the Materials and Methods section should include sufficient technical information to allow the experiments to be repeated.”

In prior essays, we have explored the adjectives descriptive ( 6 ), mechanistic ( 7 ), and important ( 8 ) as they apply to biology, and experimental science, in particular. In this essay, we explore the problem of reproducibility in science, with emphasis on the type of science is that routinely reported in Infection and Immunity . In exploring the topic of reproducibility, it is useful to first consider terminology. “Reproducibility” is defined by the Oxford English Dictionary as “the extent to which consistent results are obtained when produced repeatedly.” Although it is taken for granted that scientific experiments should be reproducible, it is worth remembering that irreproducible one-time events can still be a tremendously important source of scientific information. This is particularly true for observational sciences in which inferences are made from events and processes not under an observer's control. For example, the collision of comet Shoemaker-Levy with Jupiter in July 1994 provided a bonanza of information on Jovian atmospheric dynamics and prima facie evidence for the threat of meteorite and comet impacts. Consequently, the criterion of reproducibility is not an essential requirement for the value of scientific information, at least in some fields. Scientists studying the evolution of life on earth must contend with their inability to repeat that magnificent experiment. Gould famously observed that if one were to “rewind the tape of life,” the results would undoubtedly be different, with the likely outcome that nothing resembling ourselves would exist ( 12 ). (Note for younger scientists: it used to be fashionable to record sounds and images on metal oxide-coated tape and play them back on devices called “tape players.”) This is supported by the importance of stochastic and contingent events in experimental evolutionary systems ( 4 ).

Given the requirement for reproducibility in experimental science, we face two apparent contradictions. First, published science is expected to be reproducible, yet most scientists are not interested in replicating published experiments or reading about them. Many reputable journals, including Infection and Immunity , are unlikely to accept manuscripts that precisely replicate published findings, despite the explicit requirement that experimental protocols must be reported in sufficient detail to allow repetition. This leads to a second paradox that published science is assumed to be reproducible, yet only rarely is the reproducibility of such work tested or known. In fact, the emphasis on reproducing experimental results becomes important only when work becomes controversial or called into doubt. Replication can even be hazardous. The German scientist Georg Wilhelm Reichmann was fatally electrocuted during an attempt to reproduce Ben Franklin's famous experiment with lightning ( 1 ). The assumption that science must be reproducible is implicit yet seldom tested, and in many systems the true reproducibility of experimental data is unknown or has not been rigorously investigated in a systematic fashion. Hence, the solidity of this bedrock assumption of experimental science lies largely in the realm of belief and trust in the integrity of the authors.

Reproducibility versus replicability.

Although many biological scientists intuitively believe that the reproducibility of an experiment means that it can be replicated, Drummond makes a distinction between these two terms ( 9 ). Drummond argues that reproducibility requires changes, whereas replicability avoids them ( 9 ). In other words, reproducibility refers to a phenomenon that can be predicted to recur even when experimental conditions may vary to some degree. On the other hand, replicability describes the ability to obtain an identical result when an experiment is performed under precisely identical conditions. For biological scientists, this would appear to be an important distinction with everyday implications. For example, consider a lab attempting to reproduce another lab's finding that a certain bacterial gene confers a certain phenotype. Such an experiment might involve making gene-deficient variants, observing the effects of gene deletion on the phenotype, and, if phenotypic changes are apparent, then going further to show that gene complementation restores the original phenotype. Given a high likelihood of microevolution in microbial strains and the possibility that independently synthesized gene disruption and replacement cassettes may have subtly different effects, then the attempt to reproduce findings does not necessarily involve a precise replication of the original experiment. Nevertheless, if the results from both laboratories are concordant, then the experiment is considered to be successfully reproduced, despite the fact that, according to Drummond's distinction, it was never replicated. On the other hand, if the results differ, a myriad of possible explanations must be considered, some of which relate to differences in experimental protocols. Hence, it would seem that scientists are generally interested in the reproducibility of results rather than the precise replication of experimental results. Some variation of conditions is considered desirable because obtaining the same result without absolutely faithful replication of the experimental conditions implies a certain robustness of the original finding. In this example, the replicatibility of the original experiment following the exact protocols initially reported would be important only if all subsequent attempts to reproduce the result were unsuccessful. When findings are so dependent on precise experimental conditions that replicatibility is needed for reproducibility, the result may be idiosyncratic and less important than a phenomenon that can be reproduced by a variety of independent, nonidentical approaches.

Replicability requirement for individual studies.

Given the difference between reproducibility and replicability that depends on whether experimental conditions are subject to variation, it is apparent that when most papers state that data are reproducible, they actually mean that the experiment has been replicated. On the other hand, when different laboratories report the confirmation of a phenomenon, it is likely that this reflects reproducibility, since experimental variability between labs is likely to result in some variable(s) being changed. In fact, depending on the number of variables involved, replicability may be achievable only in the original laboratory and possibly by the same experimenter. This accounts for the greater confidence one has in a scientific observation that has been corroborated by independent observers.

The desirability of replicability in experimental science leads to the practical question of how many times an experiment should be replicated before publication. Most reviewers would demand at least one replication, while preferring more. In this situation, the replicability of an experiment provides assurance that the effect is not due to chance alone or an experimental artifact resulting in a one-time event. Ideally, an experiment should be repeated multiple times before it is reported, with the caveat that for some experiments the expense of this approach may be prohibitive. Guidelines for experimentation with vertebrate animals also discourage the use of unnecessary duplication ( 10 , 17 ). In fact, some institutions may explicitly prohibit the practice of repeating animal experiments that reproduce published results. We agree with the need to repeat experiments but suggest that authors strive for reproducibility instead of simple replicability. For example, consider an experiment in which a particular variable, the level of a specific antibody, is believed to account for a specific experimental outcome, resistance to a microbial pathogen. Passive administration of the immunoglobulin can be used to provide protection and support the hypothesis. Rather than simply replicating this experiment, the investigator might more fruitfully conduct a dose-response experiment to determine the effect of various antibody doses or microbial inocula and test multiple strains rather than simply carrying out multiple replicates of the original experiment.

Limits of replicability and reproducibility.

Although the ability of an investigator to confirm an experimental result is essential to good science, with an inherent assumption of reproducibility, we note that there are practical and philosophical limits to the replicability and reproducibility of findings. Although to our knowledge this question has not been formally studied, replicability is likely to be inversely proportional to the number of variables in an experiment. This is all too apparent in clinical studies, leading Ioannidis to conclude that most published research findings are false ( 13 ). Statistical analysis and meta-analysis would not be required if biological experiments were precisely replicatable. Initial results from genetic association studies are frequently unconfirmed by follow-up analyses ( 14 ), clinical trials based on promising preclinical studies frequently fail ( 16 ), and a recent paper reported that only a minority of published microarray results could be repeated ( 15 ). Such observations have even led some to question the validity of the requirement for replication in science ( 21 ).

Every variable contains a certain degree of error. Since error propagates linearly or nonlinearly depending on the system, one may conclude that the more variables involved, the more errors can be expected, thus reducing the replicability of an experiment. Scientists may attempt to control variables in order to achieve greater reproducibility but must remember that as they do so, they may progressively depart from the heterogeneity of real life. In our hypothetical experiment relating specific antibody to host resistance, errors in antibody concentration, inoculum, and consistency of delivery can conspire to produce different outcomes with each replication attempt. Although these errors may be minimized by good experimental technique, they cannot be eliminated entirely. There are other sources of variation in the experiment that are more difficult to control. For example, mouse groups may differ, despite being matched by genetics, supplier, gender, and age, in such intangible areas as nutrition, stress, circadian rhythm, etc. Similarly, it is very difficult to prepare infectious inocula on different days that closely mirror one another given all the variables that contribute to microbial growth and virulence. To further complicate matters, the outcomes of complex processes such as infection and the host response do not often manifest simple dose-response relationships. Inherent stochasticity in biological processes ( 19 ) and anatomic or functional bottlenecks ( 2 ) provide additional sources of experiment-to-experiment variability. For many experiments reported in Infection and Immunity , the outcome of the experiment is highly dependent on initial experimental conditions, and small variations in the initial variables can lead to chaotic results. In such systems where exact replicability is difficult or impossible to achieve, the goal should be general reproducibility of the overall results. Ironically, results that are replicated too precisely are “too good to be true” and raise suspicions of data falsification ( 3 ), illustrating the tacit recognition that biological results inherently exhibit a degree of variation.

To continue the example given above, the conclusion that antibody was protective may be reproduced in subsequent experiments despite the fact that the precise initial result on average survival was never replicated, in the sense that subsequent experiments varied in magnitude of difference observed and time to death for the various groups. Investigators may be able to increase the likelihood that individual experiments are reproducible by enhancing their robustness. A well-known strategy to enhance the likelihood of reproducibility is to increase the power of the experiment by increasing the number of individual measurements, in order to minimize the contribution of errors or random effects. For example, using 10 mice per group in the aforementioned experiment is more likely to lead to reproducible results than using 3 mice, other things being equal. Along the same lines, two experiments using 10 mice each will provide more confidence in the robustness of the results than will a single experiment involving 20 animals, because obtaining similar results on different days lessens the likelihood that a given result was strongly influenced by an unrecognized variable on the particular day of the experiment. When reviewers criticize low power in experimental design, they are essentially worried that the effect of variable uncertainty on low numbers of measurements will adversely influence the reproducibility of the findings. However, subjective judgments based on conflicting values can influence the determination of sample size. For instance, investigators and reviewers are more likely to accept smaller sample sizes in experiments using primates. Consequently, a sample size of 3 might be acceptable in an experiment using chimpanzees while the same sample size might be regarded as unacceptable in a mouse experiment, even if the results in both cases achieve statistical significance. Similarly, cost can be a mitigating factor in determining the minimum number of replicates. For nucleic acid chip hybridization experiments, measurements in triplicate are recommended despite the complexity of such experiments and the range of variation inherent in such measurement, a recommendation that tacitly accepts the prohibitive cost of larger numbers of replicates for most investigators ( 5 ). Cost is also a major consideration in replicating transgenic or knockout mouse experiments in which mouse construction may take years. Hence, the power of an experiment can be estimated accurately using statistics, but real-life considerations ranging from the ethics of animal experimentation to monetary expense can influence investigator and reviewer judgment.

We cannot leave the subject of scientific reproducibility without acknowledging that questions about replicability and reproducibility have long been at the heart of philosophical debates about the nature of science and the line of demarcation between science and non-science. While scientists and reviewers demand evidence for the reproducibility of scientific findings, philosophers of science have largely discarded the view that scientific knowledge should meet the criterion that it is verifiable. Through inductive reasoning, Bacon used data to infer that under similar circumstances a result will be repeated and can be used to make generalizations about other related situations ( 11 ). However, the logical consistency of such views was challenged by Hume, who posited that inferences from experiences (or, in our case, experiments) cannot be assumed to hold in the future because the future may not necessarily be like the past. In other words, even the daily rising of the sun for millennia does not provide absolute assurance that it will rise the next day. The philosophies of logical positivism and verificationism viewed truth as reflecting the reproducibility of empirical experience, dependent on propositions that could be proven to be true or false. This was challenged by Popper, who suggested that a hypothesis could not be proven, only falsified or not, leaving open the possibility of a rare predictable exception, vividly depicted as the metaphor of a “black swan” ( 20 ). One million sightings of white swans cannot prove the hypothesis that all swans are white, but the hypothesis can be falsified by the sight of a single black swan.

A pragmatic approach to reproducibility.

Given the challenges of achieving and defining replicatibility and reproducibility in experimental science, what practical guidance can we provide? Despite valid concerns ranging from the true reproducibility of experimental science to the logical inconsistencies identified by philosophers of science, experimental reproducibility remains a standard and accepted criterion for publication. Hence, investigators must strive to obtain information with regard to the reproducibility of their results. That, in turn, raises the question of the number of replications needed for acceptance by the scientific community. The number of times that an experiment is performed should be clearly stated in a manuscript. A new finding should be reproduced at least once and preferably more times. However, even here there is some room for judgment under exceptional circumstances. Consider a trial of a new therapeutic molecule that is expected to produce a certain result in a primate experiment based on known cellular processes. If one were to obtain precisely the predicted result, one might present a compelling argument for accepting the results of the single experiment on moral grounds regarding animal experimentation, especially in situations in which the experiment results in injury or death to the animal. At the other extreme, when an experiment is easily and inexpensively carried out without ethical considerations, then it behooves the investigator to ascertain the replicability and reproducibility of a result as fully as possible. However, there are no hard and fast rules for the number of times that an experiment should be replicated before a manuscript is considered acceptable for publication. In general, the importance of reproducibility increases in proportion to the importance of a result, and experiments that challenge existing beliefs and assumptions will be subjected to greater scrutiny than those fitting within established paradigms.

Given that most experimental results reported in the literature will not be subjected to the test of precise replication unless the results are challenged, it is essential for investigators to make their utmost efforts to place only the most robust data into print, and this almost always involves a careful assessment of the variability inherent in a particular experimental protocol and the provision of information regarding the replicability of the results. In this instance, more is better than less. To ensure that research findings are robust, it is particularly desirable to demonstrate their reproducibility in the face of variations in experimental conditions. Reproducibility remains central to science, even as we recognize the limits of our ability to achieve absolute predictability in the natural world. Then again, ask us next week and you might get a different answer.

The views expressed in this Editorial do not necessarily reflect the views of the journal or of ASM.

Editor: A. Camilli

▿ Published ahead of print on 27 September 2010.

IMAGES

  1. 4 Ways To Achieve Reproducible Flow Cytometry Results

    reproducible experiment results

  2. Reproducible Results for the purging experiment for Cu(bdc)·xH 2 O Film

    reproducible experiment results

  3. Results of reproducibility experiments. (a) Indoor result before

    reproducible experiment results

  4. Reproducible Reseach: Steps to Reproducible Research

    reproducible experiment results

  5. Reproducible Research and Data Analysis · GitBook

    reproducible experiment results

  6. [Figure, Figure 3: Result of reproducibility...]

    reproducible experiment results

VIDEO

  1. Spacetime Mechanics: Toroidal Capacitor Ring Feild

  2. Reproducible Success Strategies

  3. Biology/ecology Part 2

  4. 【Why organic 7】

  5. ASMS 2019: diaPASEF

  6. Experimental Probability / Relative Frequency

COMMENTS

  1. Reproducibility

    Reproducibility. Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high ...

  2. Understanding experiments and research practices for reproducibility

    We define a scientific experiment as reproducible if the experiment can be performed to get the same or similar (close-by) results by making variations in the original experiment (Samuel, 2019). The variations can be done in one or more of the variables like steps, data, settings, experimental execution environment, agents, order of execution ...

  3. Why Should Scientific Results Be Reproducible?

    Reproducing experiments is one of the cornerstones of the scientific process. Here's why it's so important. Since 2005, when Stanford University professor John Ioannidis published his paper "Why ...

  4. Reproducibility of Scientific Results

    Reproducibility of Scientific Results. First published Mon Dec 3, 2018. The terms "reproducibility crisis" and "replication crisis" gained currency in conversation and in print over the last decade (e.g., Pashler & Wagenmakers 2012), as disappointing results emerged from large scale reproducibility projects in various medical, life and ...

  5. Six factors affecting reproducibility in life science research and how

    The lack of reproducibility in scientific research has negative impacts on health, lower scientific output efficiency, slower 6, 7 scientific progress, wasted time and money, and erodes the public ...

  6. Reproducibility vs Replicability

    Reproducibility vs Replicability | Difference & Examples. Published on August 19, 2022 by Kassiani Nikolopoulou.Revised on June 22, 2023. The terms reproducibility, repeatability, and replicability are sometimes used interchangeably, but they mean different things.. A research study is reproducible when the existing data is reanalysed using the same research methods and yields the same results.

  7. Summary

    Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one. 1 A single scientific study may include elements or any combination of these concepts. In short, reproducibility involves the original data and code; replicability ...

  8. A Beginner's Guide to Conducting Reproducible Research

    Reproducible research also benefits others in the scientific community. Sharing data, code, and detailed research methods and results leads to faster progress in methodological development and innovation because research is more accessible to more scientists (Parr and Cummings 2005, Roche et al. 2015, Mislan et al. 2016).

  9. The fundamental principles of reproducibility

    This is done through the testing of hypotheses that typically are proposed on the basis of a scientific theory. A hypothesis is a statement about the world that is either true or false, and it is tested by conducting an experiment. Reproducibility is about confirming the results of a past experiment by conducting a reproducibility experiment.

  10. Reproducibility and Replicability in Science

    The term reproducibility can also be used in the context of the second question: research is reproducible if another researcher actually uses the available data and code and obtains the same results. The difference between the first and the second questions is one of action by another researcher; the first refers to the availability of the data ...

  11. 3 Understanding Reproducibility and Replicability

    The term reproducibility can also be used in the context of the second question: research is reproducible if another researcher actually uses the available data and code and obtains the same results. The difference between the first and the second questions is one of action by another researcher; the first refers to the availability of the data ...

  12. Reproducibility and Replicability in Research

    Reproducibility is defined as obtaining consistent results using the same data and code as the original study (synonymous with computational reproducibility). Replicability means obtaining consistent results across studies aimed at answering the same scientific question using new data or other new computational methods.

  13. What does research reproducibility mean?

    Results reproducibility (previously described as replicability) refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible. As with methods reproducibility, this might be clear in principle but is operationally elusive.

  14. Quantify and control reproducibility in high-throughput experiments

    Ensuring reproducibility of results in high-throughput experiments is crucial for biomedical research. Here, we propose a set of computational methods, INTRIGUE, to evaluate and control ...

  15. New Report Examines Reproducibility and Replicability in Science

    While computational reproducibility in scientific research is generally expected when the original data and code are available, lack of ability to replicate a previous study -- or obtain consistent results looking at the same scientific question but with different data -- is more nuanced and occasionally can aid in the process of scientific discovery, says a new congressionally mandated report ...

  16. Reproducibility: why it matters and how it can be nurtured

    With Results Masked Review, the experiments have already taken place, but the reviewers are first sent the paper with the results masked. Both of these models prevent publication bias and enhance transparency, thereby ensuring that results aren't skewed in pursuit of publication. ... As outlined in the manifesto for reproducible science opens ...

  17. What does research reproducibility mean?

    ducibility of results, results reproducibility, reproducibility of study, study reproducibility, reproducible research, reproducible finding, or reproducible res ult. Papers are classified by discipline on the basis of the journal, following an adaptation and expansion of Thomson Reuters' Essential Science Indica-tors classification system.

  18. Reproducibility in chemistry research

    1.Introduction. The words "reproducibility" and "replicability" applied to the results of scientific work are often used as synonyms to indicate the ability of reproducing or replicating prior work [1].According to a committee appointed in the late 2010s by the National Academies of Sciences of the United States of America, reproducibility would be "obtaining consistent results using ...

  19. Replicates and repeats—what is the difference and is it significant?:

    To be convincing, a scientific paper describing a new finding needs to provide evidence that the results are reproducible. ... Fig 4 shows the results of the same qRT‐PCR experiment as in Fig 3, but in this case, for one of the sets of triplicate PCR ratios there is much more variation than in the others. Furthermore, this large variation can ...

  20. Repeatability vs. Reproducibility

    The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows that the results obtained are not artifacts of the unique setup in one research lab. It's easy to see why reproducibility is desirable, as it reinforces findings and protects against rare ...

  21. Improving Reproducibility and Replicability

    Making the creation of reproducible experiments easy and an integral part of scientists' computational environments would provide a great incentive for much broader adoption of reproducibility. ... reproducing or replicating results before publishing is an effective though time-consuming way to ensure that published results are reproducible or ...

  22. What processes are needed to draw conclusions from data?

    When describing the accuracy of an experiment, confidence in the results should be discussed. If only one reading was taken, confidence in the accuracy of the results will be lower.

  23. Parallel experiments in electrochemical CO2 reduction enabled by

    Electrocatalytic CO2 reduction powered by renewable electricity is a promising technology for sustainable fuel and chemical production but accurate and reproducible analytical methods are required ...

  24. An asynchronous, hands-off workflow for looking time experiments with

    A key challenge for VoE experiments is that data collection and processing are slow and labor intensive. As a result, sample sizes in traditional VoE experiments are small, with typical sample sizes between 16 and 24 infants (Bergmann et al., 2018; Oakes, 2017).The small samples reflect the high marginal cost of each additional infant included in a study.

  25. Conceptualizing, Measuring, and Studying Reproducibility

    Goodman highlighted some of the literature that describes how reproducible research and replications are defined. Peng et al. (2006) ... and replicability is a property of a result that can be proved only by inspecting other results of similar experiments. Therefore, the reproducibility of a result from a single study can be assured, and ...

  26. Molecular level seasonality of dissolved organic matter in freshwater

    Improved characterization of dissolved organic matter (DOM) in source waters used for drinking water treatment is necessary to optimize treatment processes and obtain high drinking water quality. In this study, seasonal differences in freshwater DOM composition and associated treatment-induced changes, were investi

  27. arXiv:2406.18880v1 [cs.CL] 27 Jun 2024

    iments, making our results directly reproducible. The max_tokens (max. no. of generated tokens) parameter is set to 1024 for POS and NER tasks, while 15 for the NLI. For all experiments, the no. of exemplars (M) is fixed to 8 for uniform com-parison. For ILP solver, we use Python's gurobipy 4 package. The run-time for ILP per test query =

  28. Reproducible Science

    For example, using 10 mice per group in the aforementioned experiment is more likely to lead to reproducible results than using 3 mice, other things being equal. Along the same lines, two experiments using 10 mice each will provide more confidence in the robustness of the results than will a single experiment involving 20 animals, because ...