Experimental Design: Types, Examples & Methods

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Experimental design refers to how participants are allocated to different groups in an experiment. Types of design include repeated measures, independent groups, and matched pairs designs.

Probably the most common way to design an experiment in psychology is to divide the participants into two groups, the experimental group and the control group, and then introduce a change to the experimental group, not the control group.

The researcher must decide how he/she will allocate their sample to the different experimental groups.  For example, if there are 10 participants, will all 10 participants participate in both groups (e.g., repeated measures), or will the participants be split in half and take part in only one group each?

Three types of experimental designs are commonly used:

1. Independent Measures

Independent measures design, also known as between-groups , is an experimental design where different participants are used in each condition of the independent variable.  This means that each condition of the experiment includes a different group of participants.

This should be done by random allocation, ensuring that each participant has an equal chance of being assigned to one group.

Independent measures involve using two separate groups of participants, one in each condition. For example:

Independent Measures Design 2

  • Con : More people are needed than with the repeated measures design (i.e., more time-consuming).
  • Pro : Avoids order effects (such as practice or fatigue) as people participate in one condition only.  If a person is involved in several conditions, they may become bored, tired, and fed up by the time they come to the second condition or become wise to the requirements of the experiment!
  • Con : Differences between participants in the groups may affect results, for example, variations in age, gender, or social background.  These differences are known as participant variables (i.e., a type of extraneous variable ).
  • Control : After the participants have been recruited, they should be randomly assigned to their groups. This should ensure the groups are similar, on average (reducing participant variables).

2. Repeated Measures Design

Repeated Measures design is an experimental design where the same participants participate in each independent variable condition.  This means that each experiment condition includes the same group of participants.

Repeated Measures design is also known as within-groups or within-subjects design .

  • Pro : As the same participants are used in each condition, participant variables (i.e., individual differences) are reduced.
  • Con : There may be order effects. Order effects refer to the order of the conditions affecting the participants’ behavior.  Performance in the second condition may be better because the participants know what to do (i.e., practice effect).  Or their performance might be worse in the second condition because they are tired (i.e., fatigue effect). This limitation can be controlled using counterbalancing.
  • Pro : Fewer people are needed as they participate in all conditions (i.e., saves time).
  • Control : To combat order effects, the researcher counter-balances the order of the conditions for the participants.  Alternating the order in which participants perform in different conditions of an experiment.

Counterbalancing

Suppose we used a repeated measures design in which all of the participants first learned words in “loud noise” and then learned them in “no noise.”

We expect the participants to learn better in “no noise” because of order effects, such as practice. However, a researcher can control for order effects using counterbalancing.

The sample would be split into two groups: experimental (A) and control (B).  For example, group 1 does ‘A’ then ‘B,’ and group 2 does ‘B’ then ‘A.’ This is to eliminate order effects.

Although order effects occur for each participant, they balance each other out in the results because they occur equally in both groups.

counter balancing

3. Matched Pairs Design

A matched pairs design is an experimental design where pairs of participants are matched in terms of key variables, such as age or socioeconomic status. One member of each pair is then placed into the experimental group and the other member into the control group .

One member of each matched pair must be randomly assigned to the experimental group and the other to the control group.

matched pairs design

  • Con : If one participant drops out, you lose 2 PPs’ data.
  • Pro : Reduces participant variables because the researcher has tried to pair up the participants so that each condition has people with similar abilities and characteristics.
  • Con : Very time-consuming trying to find closely matched pairs.
  • Pro : It avoids order effects, so counterbalancing is not necessary.
  • Con : Impossible to match people exactly unless they are identical twins!
  • Control : Members of each pair should be randomly assigned to conditions. However, this does not solve all these problems.

Experimental design refers to how participants are allocated to an experiment’s different conditions (or IV levels). There are three types:

1. Independent measures / between-groups : Different participants are used in each condition of the independent variable.

2. Repeated measures /within groups : The same participants take part in each condition of the independent variable.

3. Matched pairs : Each condition uses different participants, but they are matched in terms of important characteristics, e.g., gender, age, intelligence, etc.

Learning Check

Read about each of the experiments below. For each experiment, identify (1) which experimental design was used; and (2) why the researcher might have used that design.

1 . To compare the effectiveness of two different types of therapy for depression, depressed patients were assigned to receive either cognitive therapy or behavior therapy for a 12-week period.

The researchers attempted to ensure that the patients in the two groups had similar severity of depressed symptoms by administering a standardized test of depression to each participant, then pairing them according to the severity of their symptoms.

2 . To assess the difference in reading comprehension between 7 and 9-year-olds, a researcher recruited each group from a local primary school. They were given the same passage of text to read and then asked a series of questions to assess their understanding.

3 . To assess the effectiveness of two different ways of teaching reading, a group of 5-year-olds was recruited from a primary school. Their level of reading ability was assessed, and then they were taught using scheme one for 20 weeks.

At the end of this period, their reading was reassessed, and a reading improvement score was calculated. They were then taught using scheme two for a further 20 weeks, and another reading improvement score for this period was calculated. The reading improvement scores for each child were then compared.

4 . To assess the effect of the organization on recall, a researcher randomly assigned student volunteers to two conditions.

Condition one attempted to recall a list of words that were organized into meaningful categories; condition two attempted to recall the same words, randomly grouped on the page.

Experiment Terminology

Ecological validity.

The degree to which an investigation represents real-life experiences.

Experimenter effects

These are the ways that the experimenter can accidentally influence the participant through their appearance or behavior.

Demand characteristics

The clues in an experiment lead the participants to think they know what the researcher is looking for (e.g., the experimenter’s body language).

Independent variable (IV)

The variable the experimenter manipulates (i.e., changes) is assumed to have a direct effect on the dependent variable.

Dependent variable (DV)

Variable the experimenter measures. This is the outcome (i.e., the result) of a study.

Extraneous variables (EV)

All variables which are not independent variables but could affect the results (DV) of the experiment. Extraneous variables should be controlled where possible.

Confounding variables

Variable(s) that have affected the results (DV), apart from the IV. A confounding variable could be an extraneous variable that has not been controlled.

Random Allocation

Randomly allocating participants to independent variable conditions means that all participants should have an equal chance of taking part in each condition.

The principle of random allocation is to avoid bias in how the experiment is carried out and limit the effects of participant variables.

Order effects

Changes in participants’ performance due to their repeating the same or similar test more than once. Examples of order effects include:

(i) practice effect: an improvement in performance on a task due to repetition, for example, because of familiarity with the task;

(ii) fatigue effect: a decrease in performance of a task due to repetition, for example, because of boredom or tiredness.

Print Friendly, PDF & Email

Related Articles

Mixed Methods Research

Research Methodology

Mixed Methods Research

Conversation Analysis

Conversation Analysis

Discourse Analysis

Discourse Analysis

Phenomenology In Qualitative Research

Phenomenology In Qualitative Research

Ethnography In Qualitative Research

Ethnography In Qualitative Research

Narrative Analysis In Qualitative Research

Narrative Analysis In Qualitative Research

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 6: Experimental Research

In the late 1960s social psychologists John Darley and Bibb Latané proposed a counterintuitive hypothesis. The more witnesses there are to an accident or a crime, the less likely any of them is to help the victim (Darley & Latané, 1968) [1] .

 They also suggested the theory that this phenomenon occurs because each witness feels less responsible for helping—a process referred to as the “diffusion of responsibility.” Darley and Latané noted that their ideas were consistent with many real-world cases. For example, a New York woman named Catherine “Kitty” Genovese was assaulted and murdered while several witnesses evidently failed to help. But Darley and Latané also understood that such isolated cases did not provide convincing evidence for their hypothesized “bystander effect.” There was no way to know, for example, whether any of the witnesses to Kitty Genovese’s murder would have helped had there been fewer of them.

So to test their hypothesis, Darley and Latané created a simulated emergency situation in a laboratory. Each of their university student participants was isolated in a small room and told that he or she would be having a discussion about university life with other students via an intercom system. Early in the discussion, however, one of the students began having what seemed to be an epileptic seizure. Over the intercom came the following: “I could really-er-use some help so if somebody would-er-give me a little h-help-uh-er-er-er-er-er c-could somebody-er-er-help-er-uh-uh-uh (choking sounds)…I’m gonna die-er-er-I’m…gonna die-er-help-er-er-seizure-er- [chokes, then quiet]” (Darley & Latané, 1968, p. 379) [2] .

In actuality, there were no other students. These comments had been prerecorded and were played back to create the appearance of a real emergency. The key to the study was that some participants were told that the discussion involved only one other student (the victim), others were told that it involved two other students, and still others were told that it included five other students. Because this was the only difference between these three groups of participants, any difference in their tendency to help the victim would have to have been caused by it. And sure enough, the likelihood that the participant left the room to seek help for the “victim” decreased from 85% to 62% to 31% as the number of “witnesses” increased.

The Parable of the 38 Witnesses

The story of Kitty Genovese has been told and retold in numerous psychology textbooks. The standard version is that there were 38 witnesses to the crime, that all of them watched (or listened) for an extended period of time, and that none of them did anything to help. However, recent scholarship suggests that the standard story is inaccurate in many ways (Manning, Levine, & Collins, 2007) [3] . For example, only six eyewitnesses testified at the trial, none of them was aware that he or she was witnessing a lethal assault, and there have been several reports of witnesses calling the police or even coming to the aid of Kitty Genovese. Although the standard story inspired a long line of research on the bystander effect and the diffusion of responsibility, it may also have directed researchers’ and students’ attention away from other equally interesting and important issues in the psychology of helping—including the conditions in which people do in fact respond collectively to emergency situations.

The research that Darley and Latané conducted was a particular kind of study called an experiment. Experiments are used to determine not only whether there is a meaningful relationship between two variables but also whether the relationship is a causal one that is supported by statistical analysis. For this reason, experiments are one of the most common and useful tools in the psychological researcher’s toolbox. In this chapter, we look at experiments in detail. We will first consider what sets experiments apart from other kinds of studies and why they support causal conclusions while other kinds of studies do not. We then look at two basic ways of designing an experiment—between-subjects designs and within-subjects designs—and discuss their pros and cons. Finally, we consider several important practical issues that arise when conducting experiments.

  • Darley, J. M., & Latané, B. (1968). Bystander intervention in emergencies: Diffusion of responsibility. Journal of Personality and Social Psychology, 4 , 377–383. ↵
  • Manning, R., Levine, M., & Collins, A. (2007). The Kitty Genovese murder and the social psychology of helping: The parable of the 38 witnesses. American Psychologist, 62 , 555–562. ↵

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

what is experimental research in psychology quizlet

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Introduction to Research Methods in Psychology

Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

what is experimental research in psychology quizlet

Emily is a board-certified science editor who has worked with top digital publishing brands like Voices for Biodiversity, Study.com, GoodTherapy, Vox, and Verywell.

what is experimental research in psychology quizlet

There are several different research methods in psychology , each of which can help researchers learn more about the way people think, feel, and behave. If you're a psychology student or just want to know the types of research in psychology, here are the main ones as well as how they work.

Three Main Types of Research in Psychology

stevecoleimages/Getty Images

Psychology research can usually be classified as one of three major types.

1. Causal or Experimental Research

When most people think of scientific experimentation, research on cause and effect is most often brought to mind. Experiments on causal relationships investigate the effect of one or more variables on one or more outcome variables. This type of research also determines if one variable causes another variable to occur or change.

An example of this type of research in psychology would be changing the length of a specific mental health treatment and measuring the effect on study participants.

2. Descriptive Research

Descriptive research seeks to depict what already exists in a group or population. Three types of psychology research utilizing this method are:

  • Case studies
  • Observational studies

An example of this psychology research method would be an opinion poll to determine which presidential candidate people plan to vote for in the next election. Descriptive studies don't try to measure the effect of a variable; they seek only to describe it.

3. Relational or Correlational Research

A study that investigates the connection between two or more variables is considered relational research. The variables compared are generally already present in the group or population.

For example, a study that looks at the proportion of males and females that would purchase either a classical CD or a jazz CD would be studying the relationship between gender and music preference.

Theory vs. Hypothesis in Psychology Research

People often confuse the terms theory and hypothesis or are not quite sure of the distinctions between the two concepts. If you're a psychology student, it's essential to understand what each term means, how they differ, and how they're used in psychology research.

A theory is a well-established principle that has been developed to explain some aspect of the natural world. A theory arises from repeated observation and testing and incorporates facts, laws, predictions, and tested hypotheses that are widely accepted.

A hypothesis is a specific, testable prediction about what you expect to happen in your study. For example, an experiment designed to look at the relationship between study habits and test anxiety might have a hypothesis that states, "We predict that students with better study habits will suffer less test anxiety." Unless your study is exploratory in nature, your hypothesis should always explain what you expect to happen during the course of your experiment or research.

While the terms are sometimes used interchangeably in everyday use, the difference between a theory and a hypothesis is important when studying experimental design.

Some other important distinctions to note include:

  • A theory predicts events in general terms, while a hypothesis makes a specific prediction about a specified set of circumstances.
  • A theory has been extensively tested and is generally accepted, while a hypothesis is a speculative guess that has yet to be tested.

The Effect of Time on Research Methods in Psychology

There are two types of time dimensions that can be used in designing a research study:

  • Cross-sectional research takes place at a single point in time. All tests, measures, or variables are administered to participants on one occasion. This type of research seeks to gather data on present conditions instead of looking at the effects of a variable over a period of time.
  • Longitudinal research is a study that takes place over a period of time. Data is first collected at the beginning of the study, and may then be gathered repeatedly throughout the length of the study. Some longitudinal studies may occur over a short period of time, such as a few days, while others may take place over a period of months, years, or even decades.

The effects of aging are often investigated using longitudinal research.

Causal Relationships Between Psychology Research Variables

What do we mean when we talk about a “relationship” between variables? In psychological research, we're referring to a connection between two or more factors that we can measure or systematically vary.

One of the most important distinctions to make when discussing the relationship between variables is the meaning of causation.

A causal relationship is when one variable causes a change in another variable. These types of relationships are investigated by experimental research to determine if changes in one variable actually result in changes in another variable.

Correlational Relationships Between Psychology Research Variables

A correlation is the measurement of the relationship between two variables. These variables already occur in the group or population and are not controlled by the experimenter.

  • A positive correlation is a direct relationship where, as the amount of one variable increases, the amount of a second variable also increases.
  • In a negative correlation , as the amount of one variable goes up, the levels of another variable go down.

In both types of correlation, there is no evidence or proof that changes in one variable cause changes in the other variable. A correlation simply indicates that there is a relationship between the two variables.

The most important concept is that correlation does not equal causation. Many popular media sources make the mistake of assuming that simply because two variables are related, a causal relationship exists.

Psychologists use descriptive, correlational, and experimental research designs to understand behavior . In:  Introduction to Psychology . Minneapolis, MN: University of Minnesota Libraries Publishing; 2010.

Caruana EJ, Roman M, Herandez-Sanchez J, Solli P. Longitudinal studies . Journal of Thoracic Disease. 2015;7(11):E537-E540. doi:10.3978/j.issn.2072-1439.2015.10.63

University of Berkeley. Science at multiple levels . Understanding Science 101 . Published 2012.

By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

Chapter 6: Experimental Research

Experiment basics, learning objectives.

  • Explain what an experiment is and recognize examples of studies that are experiments and studies that are not experiments.
  • Explain what internal validity is and why experiments are considered to be high in internal validity.
  • Explain what external validity is and evaluate studies in terms of their external validity.
  • Distinguish between the manipulation of the independent variable and control of extraneous variables and explain the importance of each.
  • Recognize examples of confounding variables and explain how they affect the internal validity of a study.

What Is an Experiment?

As we saw earlier in the book, an  experiment  is a type of study designed specifically to answer the question of whether there is a causal relationship between two variables. In other words, whether changes in an independent variable  cause  changes in a dependent variable. Experiments have two fundamental features. The first is that the researchers manipulate, or systematically vary, the level of the independent variable. The different levels of the independent variable are called conditions . For example, in Darley and Latané’s experiment, the independent variable was the number of witnesses that participants believed to be present. The researchers manipulated this independent variable by telling participants that there were either one, two, or five other students involved in the discussion, thereby creating three conditions. For a new researcher, it is easy to confuse  these terms by believing there are three independent variables in this situation: one, two, or five students involved in the discussion, but there is actually only one independent variable (number of witnesses) with three different conditions (one, two or five students). The second fundamental feature of an experiment is that the researcher controls, or minimizes the variability in, variables other than the independent and dependent variable. These other variables are called extraneous variables . Darley and Latané tested all their participants in the same room, exposed them to the same emergency situation, and so on. They also randomly assigned their participants to conditions so that the three groups would be similar to each other to begin with. Notice that although the words  manipulation  and  control  have similar meanings in everyday language, researchers make a clear distinction between them. They manipulate  the independent variable by systematically changing its levels and control  other variables by holding them constant.

Four Big Validities

When we read about psychology experiments with a critical view, one question to ask is “is this study valid?” However, that question is not as straightforward as it seems because in psychology, there are many different kinds of validities. Researchers have focused on four validities to help assess whether an experiment is sound (Judd & Kenny, 1981; Morling, 2014) [1] [2] : internal validity, external validity, construct validity, and statistical validity. We will explore each validity in depth.

Internal Validity

Recall that two variables being statistically related does not necessarily mean that one causes the other. “Correlation does not imply causation.” For example, if it were the case that people who exercise regularly are happier than people who do not exercise regularly, this implication would not necessarily mean that exercising increases people’s happiness. It could mean instead that greater happiness causes people to exercise (the directionality problem) or that something like better physical health causes people to exercise   and  be happier (the third-variable problem).

The purpose of an experiment, however, is to show that two variables are statistically related and to do so in a way that supports the conclusion that the independent variable caused any observed differences in the dependent variable. The logic is based on this assumption : If the researcher creates two or more highly similar conditions and then manipulates the independent variable to produce just  one  difference between them, then any later difference between the conditions must have been caused by the independent variable. For example, because the only difference between Darley and Latané’s conditions was the number of students that participants believed to be involved in the discussion, this difference in belief must have been responsible for differences in helping between the conditions.

An empirical study is said to be high in  internal validity  if the way it was conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable. Thus experiments are high in internal validity because the way they are conducted—with the manipulation of the independent variable and the control of extraneous variables—provides strong support for causal conclusions.

External Validity

At the same time, the way that experiments are conducted sometimes leads to a different kind of criticism. Specifically, the need to manipulate the independent variable and control extraneous variables means that experiments are often conducted under conditions that seem artificial (Bauman, McGraw, Bartels, & Warren, 2014) [3] . In many psychology experiments, the participants are all undergraduate students and come to a classroom or laboratory to fill out a series of paper-and-pencil questionnaires or to perform a carefully designed computerized task. Consider, for example, an experiment in which researcher Barbara Fredrickson and her colleagues had undergraduate students come to a laboratory on campus and complete a math test while wearing a swimsuit (Fredrickson, Roberts, Noll, Quinn, & Twenge, 1998) [4] . At first, this manipulation might seem silly. When will undergraduate students ever have to complete math tests in their swimsuits outside of this experiment?

The issue we are confronting is that of external validity . An empirical study is high in external validity if the way it was conducted supports generalizing the results to people and situations beyond those actually studied. As a general rule, studies are higher in external validity when the participants and the situation studied are similar to those that the researchers want to generalize to and participants encounter everyday, often described as mundane realism . Imagine, for example, that a group of researchers is interested in how shoppers in large grocery stores are affected by whether breakfast cereal is packaged in yellow or purple boxes. Their study would be high in external validity and have high mundane realism if they studied the decisions of ordinary people doing their weekly shopping in a real grocery store. If the shoppers bought much more cereal in purple boxes, the researchers would be fairly confident that this increase would be true for other shoppers in other stores. Their study would be relatively low in external validity, however, if they studied a sample of undergraduate students in a laboratory at a selective university who merely judged the appeal of various colours presented on a computer screen; however, this study would have high psychological realism where the same mental process is used in both the laboratory and in the real world.  If the students judged purple to be more appealing than yellow, the researchers would not be very confident that this preference is relevant to grocery shoppers’ cereal-buying decisions because of low external validity but they could be confident that the visual processing of colours has high psychological realism.

We should be careful, however, not to draw the blanket conclusion that experiments are low in external validity. One reason is that experiments need not seem artificial. Consider that Darley and Latané’s experiment provided a reasonably good simulation of a real emergency situation. Or consider field experiments  that are conducted entirely outside the laboratory. In one such experiment, Robert Cialdini and his colleagues studied whether hotel guests choose to reuse their towels for a second day as opposed to having them washed as a way of conserving water and energy (Cialdini, 2005) [5] . These researchers manipulated the message on a card left in a large sample of hotel rooms. One version of the message emphasized showing respect for the environment, another emphasized that the hotel would donate a portion of their savings to an environmental cause, and a third emphasized that most hotel guests choose to reuse their towels. The result was that guests who received the message that most hotel guests choose to reuse their towels reused their own towels substantially more often than guests receiving either of the other two messages. Given the way they conducted their study, it seems very likely that their result would hold true for other guests in other hotels.

A second reason not to draw the blanket conclusion that experiments are low in external validity is that they are often conducted to learn about psychological processes  that are likely to operate in a variety of people and situations. Let us return to the experiment by Fredrickson and colleagues. They found that the women in their study, but not the men, performed worse on the math test when they were wearing swimsuits. They argued that this gender difference was due to women’s greater tendency to objectify themselves—to think about themselves from the perspective of an outside observer—which diverts their attention away from other tasks. They argued, furthermore, that this process of self-objectification and its effect on attention is likely to operate in a variety of women and situations—even if none of them ever finds herself taking a math test in her swimsuit.

Construct Validity

In addition to the generalizability of the results of an experiment, another element to scrutinize in a study is the quality of the experiment’s manipulations, or the construct validity . The research question that Darley and Latané started with is “does helping behaviour become diffused?” They hypothesized that participants in a lab would be less likely to help when they believed there were more potential helpers besides themselves. This conversion from research question to experiment design is called operationalization (see Chapter 2 for more information about the operational definition). Darley and Latané operationalized the independent variable of diffusion of responsibility by increasing the number of potential helpers. In evaluating this design, we would say that the construct validity was very high because the experiment’s manipulations very clearly speak to the research question; there was a crisis, a way for the participant to help, and increasing the number of other students involved in the discussion, they provided a way to test diffusion.

What if the number of conditions in Darley and Latané’s study changed? Consider if there were only two conditions: one student involved in the discussion or two. Even though we may see a decrease in helping by adding another person, it may not be a clear demonstration of diffusion of responsibility, just merely the presence of others. We might think it was a form of Bandura’s social inhibition  (discussed in Chapter 4 ). The construct validity would be lower. However, had there been five conditions, perhaps we would see the decrease continue with more people in the discussion or perhaps it would plateau after a certain number of people. In that situation, we may not necessarily be learning more about diffusion of responsibility or it may become a different phenomenon. By adding more conditions, the construct validity may not get higher. When designing your own experiment, consider how well the research question is operationalized your study.

Statistical Validity

A common critique of experiments is that a study did not have enough participants. The main reason for this criticism is that it is difficult to generalize about a population from a small sample. At the outset, it seems as though this critique is about external validity but there are studies where small sample sizes are not a problem ( Chapter 10 will discuss how small samples, even of only 1 person, are still very illuminating for psychology research). Therefore, small sample sizes are actually a critique of statistical validity . The statistical validity speaks to whether the statistics conducted in the study support the conclusions that are made.

Proper statistical analysis should be conducted on the data to determine whether the difference or relationship that was predicted was found. The number of conditions and the number of total participants will determine the overall size of the effect. With this information, a power analysis can be conducted to ascertain whether you are likely to find a real difference. When designing a study, it is best to think about the power analysis so that the appropriate number of participants can be recruited and tested (more on effect sizes in Chapter 12 ). To design a statistically valid experiment, thinking about the statistical tests at the beginning of the design will help ensure the results can be believed.

Prioritizing Validities

These four big validities–internal, external, construct, and statistical–are useful to keep in mind when both reading about other experiments and designing your own. However, researchers must prioritize and often it is not possible to have high validity in all four areas. In Cialdini’s study on towel usage in hotels, the external validity was high but the statistical validity was more modest. This discrepancy does not invalidate the study but it shows where there may be room for improvement for future follow-up studies (Goldstein, Cialdini, & Griskevicius, 2008) [6] . Morling (2014) points out that most psychology studies have high internal and construct validity but sometimes sacrifice external validity.

Manipulation of the Independent Variable

Again, to  manipulate  an independent variable means to change its level systematically so that different groups of participants are exposed to different levels of that variable, or the same group of participants is exposed to different levels at different times. For example, to see whether expressive writing affects people’s health, a researcher might instruct some participants to write about traumatic experiences and others to write about neutral experiences. As discussed earlier in this chapter, the different levels of the independent variable are referred to as  conditions , and researchers often give the conditions short descriptive names to make it easy to talk and write about them. In this case, the conditions might be called the “traumatic condition” and the “neutral condition.”

Notice that the manipulation of an independent variable must involve the active intervention of the researcher. Comparing groups of people who differ on the independent variable before the study begins is not the same as manipulating that variable. For example, a researcher who compares the health of people who already keep a journal with the health of people who do not keep a journal has not manipulated this variable and therefore not conducted an experiment. This distinction  is important because groups that already differ in one way at the beginning of a study are likely to differ in other ways too. For example, people who choose to keep journals might also be more conscientious, more introverted, or less stressed than people who do not. Therefore, any observed difference between the two groups in terms of their health might have been caused by whether or not they keep a journal, or it might have been caused by any of the other differences between people who do and do not keep journals. Thus the active manipulation of the independent variable is crucial for eliminating the third-variable problem.

Of course, there are many situations in which the independent variable cannot be manipulated for practical or ethical reasons and therefore an experiment is not possible. For example, whether or not people have a significant early illness experience cannot be manipulated, making it impossible to conduct an experiment on the effect of early illness experiences on the development of hypochondriasis. This caveat does not mean it is impossible to study the relationship between early illness experiences and hypochondriasis—only that it must be done using nonexperimental approaches. We will discuss this type of methodology in detail later in the book.

In many experiments, the independent variable is a construct that can only be manipulated indirectly. For example, a researcher might try to manipulate participants’ stress levels indirectly by telling some of them that they have five minutes to prepare a short speech that they will then have to give to an audience of other participants. In such situations, researchers often include a manipulation check  in their procedure. A manipulation check is a separate measure of the construct the researcher is trying to manipulate. For example, researchers trying to manipulate participants’ stress levels might give them a paper-and-pencil stress questionnaire or take their blood pressure—perhaps right after the manipulation or at the end of the procedure—to verify that they successfully manipulated this variable.

Control of Extraneous Variables

As we have seen previously in the chapter, an  extraneous variable  is anything that varies in the context of a study other than the independent and dependent variables. In an experiment on the effect of expressive writing on health, for example, extraneous variables would include participant variables (individual differences) such as their writing ability, their diet, and their shoe size. They would also include situational or task variables such as the time of day when participants write, whether they write by hand or on a computer, and the weather. Extraneous variables pose a problem because many of them are likely to have some effect on the dependent variable. For example, participants’ health will be affected by many things other than whether or not they engage in expressive writing. This influencing factor can make it difficult to separate the effect of the independent variable from the effects of the extraneous variables, which is why it is important to  control  extraneous variables by holding them constant.

Extraneous Variables as “Noise”

Extraneous variables make it difficult to detect the effect of the independent variable in two ways. One is by adding variability or “noise” to the data. Imagine a simple experiment on the effect of mood (happy vs. sad) on the number of happy childhood events people are able to recall. Participants are put into a negative or positive mood (by showing them a happy or sad video clip) and then asked to recall as many happy childhood events as they can. The two leftmost columns of  Table 6.1 show what the data might look like if there were no extraneous variables and the number of happy childhood events participants recalled was affected only by their moods. Every participant in the happy mood condition recalled exactly four happy childhood events, and every participant in the sad mood condition recalled exactly three. The effect of mood here is quite obvious. In reality, however, the data would probably look more like those in the two rightmost columns of  Table 6.1 . Even in the happy mood condition, some participants would recall fewer happy memories because they have fewer to draw on, use less effective recall strategies, or are less motivated. And even in the sad mood condition, some participants would recall more happy childhood memories because they have more happy memories to draw on, they use more effective recall strategies, or they are more motivated. Although the mean difference between the two groups is the same as in the idealized data, this difference is much less obvious in the context of the greater variability in the data. Thus one reason researchers try to control extraneous variables is so their data look more like the idealized data in  Table 6.1 , which makes the effect of the independent variable easier to detect (although real data never look quite  that  good).

4 3 3 1
4 3 6 3
4 3 2 4
4 3 4 0
4 3 5 5
4 3 2 7
4 3 3 2
4 3 1 5
4 3 6 1
4 3 8 2
 = 4  = 3  = 4  = 3

One way to control extraneous variables is to hold them constant. This technique can mean holding situation or task variables constant by testing all participants in the same location, giving them identical instructions, treating them in the same way, and so on. It can also mean holding participant variables constant. For example, many studies of language limit participants to right-handed people, who generally have their language areas isolated in their left cerebral hemispheres. Left-handed people are more likely to have their language areas isolated in their right cerebral hemispheres or distributed across both hemispheres, which can change the way they process language and thereby add noise to the data.

In principle, researchers can control extraneous variables by limiting participants to one very specific category of person, such as 20-year-old, heterosexual, female, right-handed psychology majors. The obvious downside to this approach is that it would lower the external validity of the study—in particular, the extent to which the results can be generalized beyond the people actually studied. For example, it might be unclear whether results obtained with a sample of younger heterosexual women would apply to older homosexual men. In many situations, the advantages of a diverse sample outweigh the reduction in noise achieved by a homogeneous one.

Extraneous Variables as Confounding Variables

The second way that extraneous variables can make it difficult to detect the effect of the independent variable is by becoming confounding variables. A confounding variable  is an extraneous variable that differs on average across  levels of the independent variable. For example, in almost all experiments, participants’ intelligence quotients (IQs) will be an extraneous variable. But as long as there are participants with lower and higher IQs at each level of the independent variable so that the average IQ is roughly equal, then this variation is probably acceptable (and may even be desirable). What would be bad, however, would be for participants at one level of the independent variable to have substantially lower IQs on average and participants at another level to have substantially higher IQs on average. In this case, IQ would be a confounding variable.

To confound means to confuse , and this effect is exactly why confounding variables are undesirable. Because they differ across conditions—just like the independent variable—they provide an alternative explanation for any observed difference in the dependent variable.  Figure 6.1  shows the results of a hypothetical study, in which participants in a positive mood condition scored higher on a memory task than participants in a negative mood condition. But if IQ is a confounding variable—with participants in the positive mood condition having higher IQs on average than participants in the negative mood condition—then it is unclear whether it was the positive moods or the higher IQs that caused participants in the first condition to score higher. One way to avoid confounding variables is by holding extraneous variables constant. For example, one could prevent IQ from becoming a confounding variable by limiting participants only to those with IQs of exactly 100. But this approach is not always desirable for reasons we have already discussed. A second and much more general approach—random assignment to conditions—will be discussed in detail shortly.

Bar Graph measuring Positive (Higher IQ) and Negative (Lower IQ), and Memory Performance (0-16). Positive scores 14, while Negative scores 9.

Figure 6.1 Hypothetical Results From a Study on the Effect of Mood on Memory. Because IQ also differs across conditions, it is a confounding variable.

Key Takeaways

  • An experiment is a type of empirical study that features the manipulation of an independent variable, the measurement of a dependent variable, and control of extraneous variables.
  • Studies are high in internal validity to the extent that the way they are conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable. Experiments are generally high in internal validity because of the manipulation of the independent variable and control of extraneous variables.
  • Studies are high in external validity to the extent that the result can be generalized to people and situations beyond those actually studied. Although experiments can seem “artificial”—and low in external validity—it is important to consider whether the psychological processes under study are likely to operate in other people and situations.
  • Practice: List five variables that can be manipulated by the researcher in an experiment. List five variables that cannot be manipulated by the researcher in an experiment.
  • Effect of parietal lobe damage on people’s ability to do basic arithmetic.
  • Effect of being clinically depressed on the number of close friendships people have.
  • Effect of group training on the social skills of teenagers with Asperger’s syndrome.
  • Effect of paying people to take an IQ test on their performance on that test.
  • Judd, C.M. & Kenny, D.A. (1981). Estimating the effects of social interventions . Cambridge, MA: Cambridge University Press. ↵
  • Morling, B. (2014, April). Teach your students to be better consumers. APS Observer . Retrieved from http://www.psychologicalscience.org/index.php/publications/observer/2014/april-14/teach-your-students-to-be-better-consumers.html ↵
  • Bauman, C.W., McGraw, A.P., Bartels, D.M., & Warren, C. (2014). Revisiting external validity: Concerns about trolley problems and other sacrificial dilemmas in moral psychology. Social and Personality Psychology Compass, 8/9 , 536-554. ↵
  • Fredrickson, B. L., Roberts, T.-A., Noll, S. M., Quinn, D. M., & Twenge, J. M. (1998). The swimsuit becomes you: Sex differences in self-objectification, restrained eating, and math performance. Journal of Personality and Social Psychology, 75 , 269–284. ↵
  • Cialdini, R. (2005, April). Don’t throw in the towel: Use social influence research. APS Observer . Retrieved from http://www.psychologicalscience.org/index.php/publications/observer/2005/april-05/dont-throw-in-the-towel-use-social-influence-research.html ↵
  • Goldstein, N. J., Cialdini, R. B., & Griskevicius, V. (2008). A room with a viewpoint: Using social norms to motivate environmental conservation in hotels. Journal of Consumer Research, 35 , 472–482. ↵
  • Research Methods in Psychology. Authored by : Paul C. Price, Rajiv S. Jhangiani, and I-Chant A. Chiang. Provided by : BCCampus. Located at : https://opentextbc.ca/researchmethods/ . License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike

Footer Logo Lumen Candela

Privacy Policy

B.A. in Psychology

What Is Experimental Psychology?

what is experimental research in psychology quizlet

The science of psychology spans several fields. There are dozens of disciplines in psychology, including abnormal psychology, cognitive psychology and social psychology.

One way to view these fields is to separate them into two types: applied vs. experimental psychology. These groups describe virtually any type of work in psychology.

The following sections explore what experimental psychology is and some examples of what it covers.

Experimental psychology seeks to explore and better understand behavior through empirical research methods. This work allows findings to be employed in real-world applications (applied psychology) across fields such as clinical psychology, educational psychology, forensic psychology, sports psychology, and social psychology. Experimental psychology is able to shed light on people’s personalities and life experiences by examining what the way people behave and how behavior is shaped throughout life, along with other theoretical questions. The field looks at a wide range of behavioral topics including sensation, perception, attention, memory, cognition, and emotion, according to the  American Psychological Association  (APA).

Research is the focus of experimental psychology. Using scientific methods to collect data and perform research, experimental psychology focuses on certain questions, and, one study at a time, reveals information that contributes to larger findings or a conclusion. Due to the breadth and depth of certain areas of study, researchers can spend their entire careers looking at a complex research question.

Experimental Psychology in Action

The APA  writes about  one experimental psychologist, Robert McCann, who is now retired after 19 years working at NASA. During his time at NASA, his work focused on the user experience — on land and in space — where he applied his expertise to cockpit system displays, navigation systems, and safety displays used by astronauts in NASA spacecraft. McCann’s knowledge of human information processing allowed him to help NASA design shuttle displays that can increase the safety of shuttle missions. He looked at human limitations of attention and display processing to gauge what people can reliably see and correctly interpret on an instrument panel. McCann played a key role in helping determining the features of cockpit displays without overloading the pilot or taxing their attention span.

“One of the purposes of the display was to alert the astronauts to the presence of a failure that interrupted power in a specific region,” McCann said, “The most obvious way to depict this interruption was to simply remove (or dim) the white line(s) connecting the affected components. Basic research on visual attention has shown that humans do not notice the removal of a display feature very easily when the display is highly cluttered. We are much better at noticing a feature or object that is suddenly added to a display.” McCann utilized his knowledge in experimental psychology to research and develop this very important development for NASA. 

Valve Corporation

Another experimental psychologist, Mike Ambinder, uses his expertise to help design video games. He is a senior experimental psychologist at Valve Corporation, a video game developer and developer of the software distribution platform Steam. Ambinder told  Orlando Weekly  that his career working on gaming hits such as Portal 2 and Left 4 Dead “epitomizes the intersection between scientific innovation and electronic entertainment.” His career started when he gave a presentation to Valve on applying psychology to game design; this occurred while he was finishing his PhD in experimental design. “I’m very lucky to have landed at a company where freedom and autonomy and analytical decision-making are prized,” he said. “I realized how fortunate I was to work for a company that would encourage someone with a background in psychology to see what they could contribute in a field where they had no prior experience.” 

Ambinder spends his time on data analysis, hardware research, play-testing methodologies, and on any aspect of games where knowledge of human behavior could be useful. Ambinder described Valve’s process for refining a product as straightforward. “We come up with a game design (our hypothesis), and we place it in front of people external to the company (our play-test or experiment). We gather their feedback, and then iterate and improve the design (refining the theory). It’s essentially the scientific method applied to game design, and the end result is the consequence of many hours of applying this process.” To gather play-test data, Ambinder is engaged in the newer field of biofeedback technology, which can quantify gamers’ enjoyment. His research looks at unobtrusive measurements of facial expressions that can achieve such goals. Ambinder is also examining eye-tracking as a next-generation input method.

Pursue Your Career Goals in Psychology

Develop a greater understanding of psychology concepts and applications with Concordia St. Paul’s  online bachelor’s in psychology . Enjoy small class sizes with a personal learning environment geared toward your success, and learn from knowledgeable faculty who have industry experience. 

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

6.2 Experimental Design

Learning objectives.

  • Explain the difference between between-subjects and within-subjects experiments, list some of the pros and cons of each approach, and decide which approach to use to answer a particular research question.
  • Define random assignment, distinguish it from random sampling, explain its purpose in experimental research, and use some simple strategies to implement it.
  • Define what a control condition is, explain its purpose in research on treatment effectiveness, and describe some alternative types of control conditions.
  • Define several types of carryover effect, give examples of each, and explain how counterbalancing helps to deal with them.

In this section, we look at some different ways to design an experiment. The primary distinction we will make is between approaches in which each participant experiences one level of the independent variable and approaches in which each participant experiences all levels of the independent variable. The former are called between-subjects experiments and the latter are called within-subjects experiments.

Between-Subjects Experiments

In a between-subjects experiment , each participant is tested in only one condition. For example, a researcher with a sample of 100 college students might assign half of them to write about a traumatic event and the other half write about a neutral event. Or a researcher with a sample of 60 people with severe agoraphobia (fear of open spaces) might assign 20 of them to receive each of three different treatments for that disorder. It is essential in a between-subjects experiment that the researcher assign participants to conditions so that the different groups are, on average, highly similar to each other. Those in a trauma condition and a neutral condition, for example, should include a similar proportion of men and women, and they should have similar average intelligence quotients (IQs), similar average levels of motivation, similar average numbers of health problems, and so on. This is a matter of controlling these extraneous participant variables across conditions so that they do not become confounding variables.

Random Assignment

The primary way that researchers accomplish this kind of control of extraneous variables across conditions is called random assignment , which means using a random process to decide which participants are tested in which conditions. Do not confuse random assignment with random sampling. Random sampling is a method for selecting a sample from a population, and it is rarely used in psychological research. Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too.

In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition (e.g., a 50% chance of being assigned to each of two conditions). The second is that each participant is assigned to a condition independently of other participants. Thus one way to assign participants to two conditions would be to flip a coin for each one. If the coin lands heads, the participant is assigned to Condition A, and if it lands tails, the participant is assigned to Condition B. For three conditions, one could use a computer to generate a random integer from 1 to 3 for each participant. If the integer is 1, the participant is assigned to Condition A; if it is 2, the participant is assigned to Condition B; and if it is 3, the participant is assigned to Condition C. In practice, a full sequence of conditions—one for each participant expected to be in the experiment—is usually created ahead of time, and each new participant is assigned to the next condition in the sequence as he or she is tested. When the procedure is computerized, the computer program often handles the random assignment.

One problem with coin flipping and other strict procedures for random assignment is that they are likely to result in unequal sample sizes in the different conditions. Unequal sample sizes are generally not a serious problem, and you should never throw away data you have already collected to achieve equal sample sizes. However, for a fixed number of participants, it is statistically most efficient to divide them into equal-sized groups. It is standard practice, therefore, to use a kind of modified random assignment that keeps the number of participants in each group as similar as possible. One approach is block randomization . In block randomization, all the conditions occur once in the sequence before any of them is repeated. Then they all occur again before any of them is repeated again. Within each of these “blocks,” the conditions occur in a random order. Again, the sequence of conditions is usually generated before any participants are tested, and each new participant is assigned to the next condition in the sequence. Table 6.2 “Block Randomization Sequence for Assigning Nine Participants to Three Conditions” shows such a sequence for assigning nine participants to three conditions. The Research Randomizer website ( http://www.randomizer.org ) will generate block randomization sequences for any number of participants and conditions. Again, when the procedure is computerized, the computer program often handles the block randomization.

Table 6.2 Block Randomization Sequence for Assigning Nine Participants to Three Conditions

Participant Condition
4 B
5 C
6 A

Random assignment is not guaranteed to control all extraneous variables across conditions. It is always possible that just by chance, the participants in one condition might turn out to be substantially older, less tired, more motivated, or less depressed on average than the participants in another condition. However, there are some reasons that this is not a major concern. One is that random assignment works better than one might expect, especially for large samples. Another is that the inferential statistics that researchers use to decide whether a difference between groups reflects a difference in the population takes the “fallibility” of random assignment into account. Yet another reason is that even if random assignment does result in a confounding variable and therefore produces misleading results, this is likely to be detected when the experiment is replicated. The upshot is that random assignment to conditions—although not infallible in terms of controlling extraneous variables—is always considered a strength of a research design.

Treatment and Control Conditions

Between-subjects experiments are often used to determine whether a treatment works. In psychological research, a treatment is any intervention meant to change people’s behavior for the better. This includes psychotherapies and medical treatments for psychological disorders but also interventions designed to improve learning, promote conservation, reduce prejudice, and so on. To determine whether a treatment works, participants are randomly assigned to either a treatment condition , in which they receive the treatment, or a control condition , in which they do not receive the treatment. If participants in the treatment condition end up better off than participants in the control condition—for example, they are less depressed, learn faster, conserve more, express less prejudice—then the researcher can conclude that the treatment works. In research on the effectiveness of psychotherapies and medical treatments, this type of experiment is often called a randomized clinical trial .

There are different types of control conditions. In a no-treatment control condition , participants receive no treatment whatsoever. One problem with this approach, however, is the existence of placebo effects. A placebo is a simulated treatment that lacks any active ingredient or element that should make it effective, and a placebo effect is a positive effect of such a treatment. Many folk remedies that seem to work—such as eating chicken soup for a cold or placing soap under the bedsheets to stop nighttime leg cramps—are probably nothing more than placebos. Although placebo effects are not well understood, they are probably driven primarily by people’s expectations that they will improve. Having the expectation to improve can result in reduced stress, anxiety, and depression, which can alter perceptions and even improve immune system functioning (Price, Finniss, & Benedetti, 2008).

Placebo effects are interesting in their own right (see Note 6.28 “The Powerful Placebo” ), but they also pose a serious problem for researchers who want to determine whether a treatment works. Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” shows some hypothetical results in which participants in a treatment condition improved more on average than participants in a no-treatment control condition. If these conditions (the two leftmost bars in Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” ) were the only conditions in this experiment, however, one could not conclude that the treatment worked. It could be instead that participants in the treatment group improved more because they expected to improve, while those in the no-treatment control condition did not.

Figure 6.2 Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions

Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions

Fortunately, there are several solutions to this problem. One is to include a placebo control condition , in which participants receive a placebo that looks much like the treatment but lacks the active ingredient or element thought to be responsible for the treatment’s effectiveness. When participants in a treatment condition take a pill, for example, then those in a placebo control condition would take an identical-looking pill that lacks the active ingredient in the treatment (a “sugar pill”). In research on psychotherapy effectiveness, the placebo might involve going to a psychotherapist and talking in an unstructured way about one’s problems. The idea is that if participants in both the treatment and the placebo control groups expect to improve, then any improvement in the treatment group over and above that in the placebo control group must have been caused by the treatment and not by participants’ expectations. This is what is shown by a comparison of the two outer bars in Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” .

Of course, the principle of informed consent requires that participants be told that they will be assigned to either a treatment or a placebo control condition—even though they cannot be told which until the experiment ends. In many cases the participants who had been in the control condition are then offered an opportunity to have the real treatment. An alternative approach is to use a waitlist control condition , in which participants are told that they will receive the treatment but must wait until the participants in the treatment condition have already received it. This allows researchers to compare participants who have received the treatment with participants who are not currently receiving it but who still expect to improve (eventually). A final solution to the problem of placebo effects is to leave out the control condition completely and compare any new treatment with the best available alternative treatment. For example, a new treatment for simple phobia could be compared with standard exposure therapy. Because participants in both conditions receive a treatment, their expectations about improvement should be similar. This approach also makes sense because once there is an effective treatment, the interesting question about a new treatment is not simply “Does it work?” but “Does it work better than what is already available?”

The Powerful Placebo

Many people are not surprised that placebos can have a positive effect on disorders that seem fundamentally psychological, including depression, anxiety, and insomnia. However, placebos can also have a positive effect on disorders that most people think of as fundamentally physiological. These include asthma, ulcers, and warts (Shapiro & Shapiro, 1999). There is even evidence that placebo surgery—also called “sham surgery”—can be as effective as actual surgery.

Medical researcher J. Bruce Moseley and his colleagues conducted a study on the effectiveness of two arthroscopic surgery procedures for osteoarthritis of the knee (Moseley et al., 2002). The control participants in this study were prepped for surgery, received a tranquilizer, and even received three small incisions in their knees. But they did not receive the actual arthroscopic surgical procedure. The surprising result was that all participants improved in terms of both knee pain and function, and the sham surgery group improved just as much as the treatment groups. According to the researchers, “This study provides strong evidence that arthroscopic lavage with or without débridement [the surgical procedures used] is not better than and appears to be equivalent to a placebo procedure in improving knee pain and self-reported function” (p. 85).

Doctors treating a patient in Surgery

Research has shown that patients with osteoarthritis of the knee who receive a “sham surgery” experience reductions in pain and improvement in knee function similar to those of patients who receive a real surgery.

Army Medicine – Surgery – CC BY 2.0.

Within-Subjects Experiments

In a within-subjects experiment , each participant is tested under all conditions. Consider an experiment on the effect of a defendant’s physical attractiveness on judgments of his guilt. Again, in a between-subjects experiment, one group of participants would be shown an attractive defendant and asked to judge his guilt, and another group of participants would be shown an unattractive defendant and asked to judge his guilt. In a within-subjects experiment, however, the same group of participants would judge the guilt of both an attractive and an unattractive defendant.

The primary advantage of this approach is that it provides maximum control of extraneous participant variables. Participants in all conditions have the same mean IQ, same socioeconomic status, same number of siblings, and so on—because they are the very same people. Within-subjects experiments also make it possible to use statistical procedures that remove the effect of these extraneous participant variables on the dependent variable and therefore make the data less “noisy” and the effect of the independent variable easier to detect. We will look more closely at this idea later in the book.

Carryover Effects and Counterbalancing

The primary disadvantage of within-subjects designs is that they can result in carryover effects. A carryover effect is an effect of being tested in one condition on participants’ behavior in later conditions. One type of carryover effect is a practice effect , where participants perform a task better in later conditions because they have had a chance to practice it. Another type is a fatigue effect , where participants perform a task worse in later conditions because they become tired or bored. Being tested in one condition can also change how participants perceive stimuli or interpret their task in later conditions. This is called a context effect . For example, an average-looking defendant might be judged more harshly when participants have just judged an attractive defendant than when they have just judged an unattractive defendant. Within-subjects experiments also make it easier for participants to guess the hypothesis. For example, a participant who is asked to judge the guilt of an attractive defendant and then is asked to judge the guilt of an unattractive defendant is likely to guess that the hypothesis is that defendant attractiveness affects judgments of guilt. This could lead the participant to judge the unattractive defendant more harshly because he thinks this is what he is expected to do. Or it could make participants judge the two defendants similarly in an effort to be “fair.”

Carryover effects can be interesting in their own right. (Does the attractiveness of one person depend on the attractiveness of other people that we have seen recently?) But when they are not the focus of the research, carryover effects can be problematic. Imagine, for example, that participants judge the guilt of an attractive defendant and then judge the guilt of an unattractive defendant. If they judge the unattractive defendant more harshly, this might be because of his unattractiveness. But it could be instead that they judge him more harshly because they are becoming bored or tired. In other words, the order of the conditions is a confounding variable. The attractive condition is always the first condition and the unattractive condition the second. Thus any difference between the conditions in terms of the dependent variable could be caused by the order of the conditions and not the independent variable itself.

There is a solution to the problem of order effects, however, that can be used in many situations. It is counterbalancing , which means testing different participants in different orders. For example, some participants would be tested in the attractive defendant condition followed by the unattractive defendant condition, and others would be tested in the unattractive condition followed by the attractive condition. With three conditions, there would be six different orders (ABC, ACB, BAC, BCA, CAB, and CBA), so some participants would be tested in each of the six orders. With counterbalancing, participants are assigned to orders randomly, using the techniques we have already discussed. Thus random assignment plays an important role in within-subjects designs just as in between-subjects designs. Here, instead of randomly assigning to conditions, they are randomly assigned to different orders of conditions. In fact, it can safely be said that if a study does not involve random assignment in one form or another, it is not an experiment.

There are two ways to think about what counterbalancing accomplishes. One is that it controls the order of conditions so that it is no longer a confounding variable. Instead of the attractive condition always being first and the unattractive condition always being second, the attractive condition comes first for some participants and second for others. Likewise, the unattractive condition comes first for some participants and second for others. Thus any overall difference in the dependent variable between the two conditions cannot have been caused by the order of conditions. A second way to think about what counterbalancing accomplishes is that if there are carryover effects, it makes it possible to detect them. One can analyze the data separately for each order to see whether it had an effect.

When 9 Is “Larger” Than 221

Researcher Michael Birnbaum has argued that the lack of context provided by between-subjects designs is often a bigger problem than the context effects created by within-subjects designs. To demonstrate this, he asked one group of participants to rate how large the number 9 was on a 1-to-10 rating scale and another group to rate how large the number 221 was on the same 1-to-10 rating scale (Birnbaum, 1999). Participants in this between-subjects design gave the number 9 a mean rating of 5.13 and the number 221 a mean rating of 3.10. In other words, they rated 9 as larger than 221! According to Birnbaum, this is because participants spontaneously compared 9 with other one-digit numbers (in which case it is relatively large) and compared 221 with other three-digit numbers (in which case it is relatively small).

Simultaneous Within-Subjects Designs

So far, we have discussed an approach to within-subjects designs in which participants are tested in one condition at a time. There is another approach, however, that is often used when participants make multiple responses in each condition. Imagine, for example, that participants judge the guilt of 10 attractive defendants and 10 unattractive defendants. Instead of having people make judgments about all 10 defendants of one type followed by all 10 defendants of the other type, the researcher could present all 20 defendants in a sequence that mixed the two types. The researcher could then compute each participant’s mean rating for each type of defendant. Or imagine an experiment designed to see whether people with social anxiety disorder remember negative adjectives (e.g., “stupid,” “incompetent”) better than positive ones (e.g., “happy,” “productive”). The researcher could have participants study a single list that includes both kinds of words and then have them try to recall as many words as possible. The researcher could then count the number of each type of word that was recalled. There are many ways to determine the order in which the stimuli are presented, but one common way is to generate a different random order for each participant.

Between-Subjects or Within-Subjects?

Almost every experiment can be conducted using either a between-subjects design or a within-subjects design. This means that researchers must choose between the two approaches based on their relative merits for the particular situation.

Between-subjects experiments have the advantage of being conceptually simpler and requiring less testing time per participant. They also avoid carryover effects without the need for counterbalancing. Within-subjects experiments have the advantage of controlling extraneous participant variables, which generally reduces noise in the data and makes it easier to detect a relationship between the independent and dependent variables.

A good rule of thumb, then, is that if it is possible to conduct a within-subjects experiment (with proper counterbalancing) in the time that is available per participant—and you have no serious concerns about carryover effects—this is probably the best option. If a within-subjects design would be difficult or impossible to carry out, then you should consider a between-subjects design instead. For example, if you were testing participants in a doctor’s waiting room or shoppers in line at a grocery store, you might not have enough time to test each participant in all conditions and therefore would opt for a between-subjects design. Or imagine you were trying to reduce people’s level of prejudice by having them interact with someone of another race. A within-subjects design with counterbalancing would require testing some participants in the treatment condition first and then in a control condition. But if the treatment works and reduces people’s level of prejudice, then they would no longer be suitable for testing in the control condition. This is true for many designs that involve a treatment meant to produce long-term change in participants’ behavior (e.g., studies testing the effectiveness of psychotherapy). Clearly, a between-subjects design would be necessary here.

Remember also that using one type of design does not preclude using the other type in a different study. There is no reason that a researcher could not use both a between-subjects design and a within-subjects design to answer the same research question. In fact, professional researchers often do exactly this.

Key Takeaways

  • Experiments can be conducted using either between-subjects or within-subjects designs. Deciding which to use in a particular situation requires careful consideration of the pros and cons of each approach.
  • Random assignment to conditions in between-subjects experiments or to orders of conditions in within-subjects experiments is a fundamental element of experimental research. Its purpose is to control extraneous variables so that they do not become confounding variables.
  • Experimental research on the effectiveness of a treatment requires both a treatment condition and a control condition, which can be a no-treatment control condition, a placebo control condition, or a waitlist control condition. Experimental treatments can also be compared with the best available alternative.

Discussion: For each of the following topics, list the pros and cons of a between-subjects and within-subjects design and decide which would be better.

  • You want to test the relative effectiveness of two training programs for running a marathon.
  • Using photographs of people as stimuli, you want to see if smiling people are perceived as more intelligent than people who are not smiling.
  • In a field experiment, you want to see if the way a panhandler is dressed (neatly vs. sloppily) affects whether or not passersby give him any money.
  • You want to see if concrete nouns (e.g., dog ) are recalled better than abstract nouns (e.g., truth ).
  • Discussion: Imagine that an experiment shows that participants who receive psychodynamic therapy for a dog phobia improve more than participants in a no-treatment control group. Explain a fundamental problem with this research design and at least two ways that it might be corrected.

Birnbaum, M. H. (1999). How to show that 9 > 221: Collect judgments in a between-subjects design. Psychological Methods, 4 , 243–249.

Moseley, J. B., O’Malley, K., Petersen, N. J., Menke, T. J., Brody, B. A., Kuykendall, D. H., … Wray, N. P. (2002). A controlled trial of arthroscopic surgery for osteoarthritis of the knee. The New England Journal of Medicine, 347 , 81–88.

Price, D. D., Finniss, D. G., & Benedetti, F. (2008). A comprehensive review of the placebo effect: Recent advances and current thought. Annual Review of Psychology, 59 , 565–590.

Shapiro, A. K., & Shapiro, E. (1999). The powerful placebo: From ancient priest to modern physician . Baltimore, MD: Johns Hopkins University Press.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

In the late 1960s social psychologists John Darley and Bibb Latané proposed a counter-intuitive hypothesis. The more witnesses there are to an accident or a crime, the less likely any of them is to help the victim (Darley & Latané, 1968) [1] .

 They also suggested the theory that this phenomenon occurs because each witness feels less responsible for helping—a process referred to as the “diffusion of responsibility.” Darley and Latané noted that their ideas were consistent with many real-world cases. For example, a New York woman named Catherine “Kitty” Genovese was assaulted and murdered while several witnesses evidently failed to help. But Darley and Latané also understood that such isolated cases did not provide convincing evidence for their hypothesized “bystander effect.” There was no way to know, for example, whether any of the witnesses to Kitty Genovese’s murder would have helped had there been fewer of them.

So to test their hypothesis, Darley and Latané created a simulated emergency situation in a laboratory. Each of their university student participants was isolated in a small room and told that he or she would be having a discussion about university life with other students via an intercom system. Early in the discussion, however, one of the students began having what seemed to be an epileptic seizure. Over the intercom came the following: “I could really-er-use some help so if somebody would-er-give me a little h-help-uh-er-er-er-er-er c-could somebody-er-er-help-er-uh-uh-uh (choking sounds)…I’m gonna die-er-er-I’m…gonna die-er-help-er-er-seizure-er- [chokes, then quiet]” (Darley & Latané, 1968, p. 379) [2] .

In actuality, there were no other students. These comments had been prerecorded and were played back to create the appearance of a real emergency. The key to the study was that some participants were told that the discussion involved only one other student (the victim), others were told that it involved two other students, and still others were told that it included five other students. Because this was the only difference between these three groups of participants, any difference in their tendency to help the victim would have to have been caused by it. And sure enough, the likelihood that the participant left the room to seek help for the “victim” decreased from 85% to 62% to 31% as the number of “witnesses” increased.

The Parable of the 38 Witnesses

The story of Kitty Genovese has been told and retold in numerous psychology textbooks. The standard version is that there were 38 witnesses to the crime, that all of them watched (or listened) for an extended period of time, and that none of them did anything to help. However, recent scholarship suggests that the standard story is inaccurate in many ways (Manning, Levine, & Collins, 2007) [3] . For example, only six eyewitnesses testified at the trial, none of them was aware that he or she was witnessing a lethal assault, and there have been several reports of witnesses calling the police or even coming to the aid of Kitty Genovese. Although the standard story inspired a long line of research on the bystander effect and the diffusion of responsibility, it may also have directed researchers’ and students’ attention away from other equally interesting and important issues in the psychology of helping—including the conditions in which people do in fact respond collectively to emergency situations.

The research that Darley and Latané conducted was a particular kind of study called an experiment. Experiments are used to determine not only whether there is a meaningful relationship between two variables but also whether the relationship is a causal one that is supported by statistical analysis. For this reason, experiments are one of the most common and useful tools in the psychological researcher’s toolbox. In this chapter, we look at experiments in detail. We will first consider what sets experiments apart from other kinds of studies and why they support causal conclusions while other kinds of studies do not. We then look at two basic ways of designing an experiment—between-subjects designs and within-subjects designs—and discuss their pros and cons. Finally, we consider several important practical issues that arise when conducting experiments.

  • Darley, J. M., & Latané, B. (1968). Bystander intervention in emergencies: Diffusion of responsibility. Journal of Personality and Social Psychology, 4 , 377–383. ↵
  • Manning, R., Levine, M., & Collins, A. (2007). The Kitty Genovese murder and the social psychology of helping: The parable of the 38 witnesses. American Psychologist, 62 , 555–562. ↵

Creative Commons License

Share This Book

  • Increase Font Size

2.1 Why Is Research Important?

Learning objectives.

By the end of this section, you will be able to:

  • Explain how scientific research addresses questions about behavior
  • Discuss how scientific research guides public policy
  • Appreciate how scientific research can be important in making personal decisions

Scientific research is a critical tool for successfully navigating our complex world. Without it, we would be forced to rely solely on intuition, other people’s authority, and blind luck. While many of us feel confident in our abilities to decipher and interact with the world around us, history is filled with examples of how very wrong we can be when we fail to recognize the need for evidence in supporting claims. At various times in history, we would have been certain that the sun revolved around a flat earth, that the earth’s continents did not move, and that mental illness was caused by possession ( Figure 2.2 ). It is through systematic scientific research that we divest ourselves of our preconceived notions and superstitions and gain an objective understanding of ourselves and our world.

The goal of all scientists is to better understand the world around them. Psychologists focus their attention on understanding behavior, as well as the cognitive (mental) and physiological (body) processes that underlie behavior. In contrast to other methods that people use to understand the behavior of others, such as intuition and personal experience, the hallmark of scientific research is that there is evidence to support a claim. Scientific knowledge is empirical : It is grounded in objective, tangible evidence that can be observed time and time again, regardless of who is observing.

While behavior is observable, the mind is not. If someone is crying, we can see behavior. However, the reason for the behavior is more difficult to determine. Is the person crying due to being sad, in pain, or happy? Sometimes we can learn the reason for someone’s behavior by simply asking a question, like “Why are you crying?” However, there are situations in which an individual is either uncomfortable or unwilling to answer the question honestly, or is incapable of answering. For example, infants would not be able to explain why they are crying. In such circumstances, the psychologist must be creative in finding ways to better understand behavior. This chapter explores how scientific knowledge is generated, and how important that knowledge is in forming decisions in our personal lives and in the public domain.

Use of Research Information

Trying to determine which theories are and are not accepted by the scientific community can be difficult, especially in an area of research as broad as psychology. More than ever before, we have an incredible amount of information at our fingertips, and a simple internet search on any given research topic might result in a number of contradictory studies. In these cases, we are witnessing the scientific community going through the process of reaching a consensus, and it could be quite some time before a consensus emerges. For example, the explosion in our use of technology has led researchers to question whether this ultimately helps or hinders us. The use and implementation of technology in educational settings has become widespread over the last few decades. Researchers are coming to different conclusions regarding the use of technology. To illustrate this point, a study investigating a smartphone app targeting surgery residents (graduate students in surgery training) found that the use of this app can increase student engagement and raise test scores (Shaw & Tan, 2015). Conversely, another study found that the use of technology in undergraduate student populations had negative impacts on sleep, communication, and time management skills (Massimini & Peterson, 2009). Until sufficient amounts of research have been conducted, there will be no clear consensus on the effects that technology has on a student's acquisition of knowledge, study skills, and mental health.

In the meantime, we should strive to think critically about the information we encounter by exercising a degree of healthy skepticism. When someone makes a claim, we should examine the claim from a number of different perspectives: what is the expertise of the person making the claim, what might they gain if the claim is valid, does the claim seem justified given the evidence, and what do other researchers think of the claim? This is especially important when we consider how much information in advertising campaigns and on the internet claims to be based on “scientific evidence” when in actuality it is a belief or perspective of just a few individuals trying to sell a product or draw attention to their perspectives.

We should be informed consumers of the information made available to us because decisions based on this information have significant consequences. One such consequence can be seen in politics and public policy. Imagine that you have been elected as the governor of your state. One of your responsibilities is to manage the state budget and determine how to best spend your constituents’ tax dollars. As the new governor, you need to decide whether to continue funding early intervention programs. These programs are designed to help children who come from low-income backgrounds, have special needs, or face other disadvantages. These programs may involve providing a wide variety of services to maximize the children's development and position them for optimal levels of success in school and later in life (Blann, 2005). While such programs sound appealing, you would want to be sure that they also proved effective before investing additional money in these programs. Fortunately, psychologists and other scientists have conducted vast amounts of research on such programs and, in general, the programs are found to be effective (Neil & Christensen, 2009; Peters-Scheffer, Didden, Korzilius, & Sturmey, 2011). While not all programs are equally effective, and the short-term effects of many such programs are more pronounced, there is reason to believe that many of these programs produce long-term benefits for participants (Barnett, 2011). If you are committed to being a good steward of taxpayer money, you would want to look at research. Which programs are most effective? What characteristics of these programs make them effective? Which programs promote the best outcomes? After examining the research, you would be best equipped to make decisions about which programs to fund.

Link to Learning

Watch this video about early childhood program effectiveness to learn how scientists evaluate effectiveness and how best to invest money into programs that are most effective.

Ultimately, it is not just politicians who can benefit from using research in guiding their decisions. We all might look to research from time to time when making decisions in our lives. Imagine that your sister, Maria, expresses concern about her two-year-old child, Umberto. Umberto does not speak as much or as clearly as the other children in his daycare or others in the family. Umberto's pediatrician undertakes some screening and recommends an evaluation by a speech pathologist, but does not refer Maria to any other specialists. Maria is concerned that Umberto's speech delays are signs of a developmental disorder, but Umberto's pediatrician does not; she sees indications of differences in Umberto's jaw and facial muscles. Hearing this, you do some internet searches, but you are overwhelmed by the breadth of information and the wide array of sources. You see blog posts, top-ten lists, advertisements from healthcare providers, and recommendations from several advocacy organizations. Why are there so many sites? Which are based in research, and which are not?

In the end, research is what makes the difference between facts and opinions. Facts are observable realities, and opinions are personal judgments, conclusions, or attitudes that may or may not be accurate. In the scientific community, facts can be established only using evidence collected through empirical research.

NOTABLE RESEARCHERS

Psychological research has a long history involving important figures from diverse backgrounds. While the introductory chapter discussed several researchers who made significant contributions to the discipline, there are many more individuals who deserve attention in considering how psychology has advanced as a science through their work ( Figure 2.3 ). For instance, Margaret Floy Washburn (1871–1939) was the first woman to earn a PhD in psychology. Her research focused on animal behavior and cognition (Margaret Floy Washburn, PhD, n.d.). Mary Whiton Calkins (1863–1930) was a preeminent first-generation American psychologist who opposed the behaviorist movement, conducted significant research into memory, and established one of the earliest experimental psychology labs in the United States (Mary Whiton Calkins, n.d.).

Francis Sumner (1895–1954) was the first African American to receive a PhD in psychology in 1920. His dissertation focused on issues related to psychoanalysis. Sumner also had research interests in racial bias and educational justice. Sumner was one of the founders of Howard University’s department of psychology, and because of his accomplishments, he is sometimes referred to as the “Father of Black Psychology.” Thirteen years later, Inez Beverly Prosser (1895–1934) became the first African American woman to receive a PhD in psychology. Prosser’s research highlighted issues related to education in segregated versus integrated schools, and ultimately, her work was very influential in the hallmark Brown v. Board of Education Supreme Court ruling that segregation of public schools was unconstitutional (Ethnicity and Health in America Series: Featured Psychologists, n.d.).

Although the establishment of psychology’s scientific roots occurred first in Europe and the United States, it did not take much time until researchers from around the world began to establish their own laboratories and research programs. For example, some of the first experimental psychology laboratories in South America were founded by Horatio Piñero (1869–1919) at two institutions in Buenos Aires, Argentina (Godoy & Brussino, 2010). In India, Gunamudian David Boaz (1908–1965) and Narendra Nath Sen Gupta (1889–1944) established the first independent departments of psychology at the University of Madras and the University of Calcutta, respectively. These developments provided an opportunity for Indian researchers to make important contributions to the field (Gunamudian David Boaz, n.d.; Narendra Nath Sen Gupta, n.d.).

When the American Psychological Association (APA) was first founded in 1892, all of the members were White males (Women and Minorities in Psychology, n.d.). However, by 1905, Mary Whiton Calkins was elected as the first female president of the APA, and by 1946, nearly one-quarter of American psychologists were female. Psychology became a popular degree option for students enrolled in the nation’s historically Black higher education institutions, increasing the number of Black Americans who went on to become psychologists. Given demographic shifts occurring in the United States and increased access to higher educational opportunities among historically underrepresented populations, there is reason to hope that the diversity of the field will increasingly match the larger population, and that the research contributions made by the psychologists of the future will better serve people of all backgrounds (Women and Minorities in Psychology, n.d.).

The Process of Scientific Research

Scientific knowledge is advanced through a process known as the scientific method . Basically, ideas (in the form of theories and hypotheses) are tested against the real world (in the form of empirical observations), and those empirical observations lead to more ideas that are tested against the real world, and so on. In this sense, the scientific process is circular. The types of reasoning within the circle are called deductive and inductive. In deductive reasoning , ideas are tested in the real world; in inductive reasoning , real-world observations lead to new ideas ( Figure 2.4 ). These processes are inseparable, like inhaling and exhaling, but different research approaches place different emphasis on the deductive and inductive aspects.

In the scientific context, deductive reasoning begins with a generalization—one hypothesis—that is then used to reach logical conclusions about the real world. If the hypothesis is correct, then the logical conclusions reached through deductive reasoning should also be correct. A deductive reasoning argument might go something like this: All living things require energy to survive (this would be your hypothesis). Ducks are living things. Therefore, ducks require energy to survive (logical conclusion). In this example, the hypothesis is correct; therefore, the conclusion is correct as well. Sometimes, however, an incorrect hypothesis may lead to a logical but incorrect conclusion. Consider this argument: all ducks are born with the ability to see. Quackers is a duck. Therefore, Quackers was born with the ability to see. Scientists use deductive reasoning to empirically test their hypotheses. Returning to the example of the ducks, researchers might design a study to test the hypothesis that if all living things require energy to survive, then ducks will be found to require energy to survive.

Deductive reasoning starts with a generalization that is tested against real-world observations; however, inductive reasoning moves in the opposite direction. Inductive reasoning uses empirical observations to construct broad generalizations. Unlike deductive reasoning, conclusions drawn from inductive reasoning may or may not be correct, regardless of the observations on which they are based. For instance, you may notice that your favorite fruits—apples, bananas, and oranges—all grow on trees; therefore, you assume that all fruit must grow on trees. This would be an example of inductive reasoning, and, clearly, the existence of strawberries, blueberries, and kiwi demonstrate that this generalization is not correct despite it being based on a number of direct observations. Scientists use inductive reasoning to formulate theories, which in turn generate hypotheses that are tested with deductive reasoning. In the end, science involves both deductive and inductive processes.

For example, case studies, which you will read about in the next section, are heavily weighted on the side of empirical observations. Thus, case studies are closely associated with inductive processes as researchers gather massive amounts of observations and seek interesting patterns (new ideas) in the data. Experimental research, on the other hand, puts great emphasis on deductive reasoning.

We’ve stated that theories and hypotheses are ideas, but what sort of ideas are they, exactly? A theory is a well-developed set of ideas that propose an explanation for observed phenomena. Theories are repeatedly checked against the world, but they tend to be too complex to be tested all at once; instead, researchers create hypotheses to test specific aspects of a theory.

A hypothesis is a testable prediction about how the world will behave if our idea is correct, and it is often worded as an if-then statement (e.g., if I study all night, I will get a passing grade on the test). The hypothesis is extremely important because it bridges the gap between the realm of ideas and the real world. As specific hypotheses are tested, theories are modified and refined to reflect and incorporate the result of these tests Figure 2.5 .

To see how this process works, let’s consider a specific theory and a hypothesis that might be generated from that theory. As you’ll learn in a later chapter, the James-Lange theory of emotion asserts that emotional experience relies on the physiological arousal associated with the emotional state. If you walked out of your home and discovered a very aggressive snake waiting on your doorstep, your heart would begin to race and your stomach churn. According to the James-Lange theory, these physiological changes would result in your feeling of fear. A hypothesis that could be derived from this theory might be that a person who is unaware of the physiological arousal that the sight of the snake elicits will not feel fear.

A scientific hypothesis is also falsifiable , or capable of being shown to be incorrect. Recall from the introductory chapter that Sigmund Freud had lots of interesting ideas to explain various human behaviors ( Figure 2.6 ). However, a major criticism of Freud’s theories is that many of his ideas are not falsifiable; for example, it is impossible to imagine empirical observations that would disprove the existence of the id, the ego, and the superego—the three elements of personality described in Freud’s theories. Despite this, Freud’s theories are widely taught in introductory psychology texts because of their historical significance for personality psychology and psychotherapy, and these remain the root of all modern forms of therapy.

In contrast, the James-Lange theory does generate falsifiable hypotheses, such as the one described above. Some individuals who suffer significant injuries to their spinal columns are unable to feel the bodily changes that often accompany emotional experiences. Therefore, we could test the hypothesis by determining how emotional experiences differ between individuals who have the ability to detect these changes in their physiological arousal and those who do not. In fact, this research has been conducted and while the emotional experiences of people deprived of an awareness of their physiological arousal may be less intense, they still experience emotion (Chwalisz, Diener, & Gallagher, 1988).

Scientific research’s dependence on falsifiability allows for great confidence in the information that it produces. Typically, by the time information is accepted by the scientific community, it has been tested repeatedly.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/psychology-2e/pages/1-introduction
  • Authors: Rose M. Spielman, William J. Jenkins, Marilyn D. Lovett
  • Publisher/website: OpenStax
  • Book title: Psychology 2e
  • Publication date: Apr 22, 2020
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/psychology-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/psychology-2e/pages/2-1-why-is-research-important

© Jan 6, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sage Choice

Logo of sageopen

The psychology of experimental psychologists: Overcoming cognitive constraints to improve research: The 47th Sir Frederic Bartlett Lecture

Like many other areas of science, experimental psychology is affected by a “replication crisis” that is causing concern in many fields of research. Approaches to tackling this crisis include better training in statistical methods, greater transparency and openness, and changes to the incentives created by funding agencies, journals, and institutions. Here, I argue that if proposed solutions are to be effective, we also need to take into account human cognitive constraints that can distort all stages of the research process, including design and execution of experiments, analysis of data, and writing up findings for publication. I focus specifically on cognitive schemata in perception and memory, confirmation bias, systematic misunderstanding of statistics, and asymmetry in moral judgements of errors of commission and omission. Finally, I consider methods that may help mitigate the effect of cognitive constraints: better training, including use of simulations to overcome statistical misunderstanding; specific programmes directed at inoculating against cognitive biases; adoption of Registered Reports to encourage more critical reflection in planning studies; and using methods such as triangulation and “pre mortem” evaluation of study design to foster a culture of dialogue and criticism.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-img1.jpg

Introduction

The past decade has been a bruising one for experimental psychology. The publication of a paper by Simmons, Nelson, and Simonsohn (2011) entitled “False-positive psychology” drew attention to problems with the way in which research was often conducted in our field, which meant that many results could not be trusted. Simmons et al. focused on “undisclosed flexibility in data collection and analysis,” which is now variously referred to as p -hacking, data dredging, noise mining, or asterisk hunting: exploring datasets with different selections of variables and different analyses to attain a p -value lower than .05 and, subsequently, reporting only the significant findings. Hard on the heels of their demonstration came a wealth of empirical evidence from the Open Science Collaboration (2015) . This showed that less than half the results reported in reputable psychological journals could be replicated in a new experiment.

The points made by Simmons et al. (2011) were not new: indeed, they were anticipated in 1830 by Charles Babbage, who described “cooking” of data:

This is an art of various forms, the object of which is to give ordinary observations the appearance and character of those of the highest degree of accuracy. One of its numerous processes is to make multitudes of observations, and out of these to select only those which agree, or very nearly agree. If a hundred observations are made, the cook must be very unhappy if he cannot pick out fifteen or twenty which will do for serving up. (p. 178–179)

P -hacking refers to biased selection of data or analyses from within an experiment. Bias also affects which studies get published in the form of publication bias—the tendency for positive results to be overrepresented in the published literature. This is problematic because it gives an impression that findings are more consistent than is the case, which means that false theories can attain a state of “canonisation,” where they are widely accepted as true ( Nissen, Magidson, Gross, & Bergstrom, 2016 ). Figure 1 illustrates this with a toy simulation of a set of studies testing a difference between means from two conditions. If we have results from a series of experiments, three of which found a statistically significant difference and three of which did not, this provides fairly strong evidence that the difference is real (panel a). However, if we add a further four experiments that were not reported because results were null, the evidence cumulates in the opposite direction. Thus, omission of null studies can drastically alter our impression of the overall support for a hypothesis.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-fig1.jpg

The impact of publication bias demonstrated with plots of cumulative log odds in favour of true versus null effect over a series of experiments. The log odds for each experiment can be computed with knowledge of alpha (.05) and power (.8); 1 denotes an experiment with significant difference between means, and 0, a null result. The starting point is zero, indicating that we assume a 50:50 chance of a true effect. For each significant result, the log odds of it coming from a true effect versus a null effect is log(.8/.05) = 2.77. For a null result, the log odds is log (.2/.95) = −1.55. The selected set of studies in panel (a) concludes with a log odds greater than 3, indicating that the likelihood of a true effect is 20 times greater than a null effect. However, panel (b), which includes additional null results (labelled in grey), leads to the opposite conclusion.

Since the paper by Simmons et al. (2011) , there has been a dramatic increase in replication studies. As a result, a number of well-established phenomena in psychology have come into question. Often it is difficult to be certain whether the original reports were false positives, whether the replication was flawed, or whether the effect of interest is only evident under specific conditions—see, for example, Hobson and Bishop (2016) on mu suppression in response to observed actions; Sripada, Kesller, and Jonides (2016) on ego depletion; Lehtonen et al. (2018) on an advantage in cognitive control for bilinguals; O’Donnell et al. (2018) on the professor-priming effect; and Oostenbroek et al. (2016) on neonatal imitation. What is clear is that the size, robustness, and generalisability of many classic effects are lower than previously thought.

Selective reporting, through p -hacking and publication bias, is not the only blight on our science. A related problem is many editors place emphasis on reporting results in a way that “tells a good story,” even if that means retrofitting our hypothesis to the data, i.e., HARKing or “hypothesising after the results are known” ( Kerr, 1998 ). Oberauer and Lewandowsky (2019) drew parallels between HARKing and p -hacking: in HARKing, there is post hoc selection of hypotheses, rather than selection of results or an analytic method. They proposed that HARKing is most widely used in fields where theories are so underspecified that they can accommodate many hypotheses and where there is a lack of “disconfirmatory diagnosticity,” i.e., failure to support a prediction is uninformative.

A lack of statistical power is a further problem for psychology—one that has been recognised since 1969 , when Jacob Cohen exhorted psychologists not to waste time and effort doing experiments that had too few observations to show an effect of interest. In other fields, notably clinical trials and genetics, after a period where non-replicable results proliferated, underpowered studies died out quite rapidly when journals adopted stringent criteria for publication (e.g., Johnston, Lahey, & Matthys, 2013 ), and funders began to require power analysis in grant proposals. Psychology, however, has been slow to catch up.

It is not just experimental psychology that has these problems—studies attempting to link psychological traits and disorders to genetic and/or neurobiological variables are, if anything, subject to greater challenges. A striking example comes from a meta-analysis of links between the serotonin transporter gene, 5-HTTPLR, and depression. This postulated association has attracted huge research interest over the past 20 years, and the meta-analysis included 450 studies. Contrary to expectation, it concluded that there was no evidence of association. In a blog post summarising findings, Alexander (2019) wrote,

. . . what bothers me isn’t just that people said 5-HTTLPR mattered and it didn’t. It’s that we built whole imaginary edifices, whole castles in the air on top of this idea of 5-HTTLPR mattering. We “figured out” how 5-HTTLPR exerted its effects, what parts of the brain it was active in, what sorts of things it interacted with, how its effects were enhanced or suppressed by the effects of other imaginary depression genes. This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.

It is no exaggeration to say that our field is at a crossroads ( Pashler & Wagenmakers, 2012 ), and the 5-HTTLPR story is just a warning sign that practices that lead to bad science are widespread. If we continue to take the well-trodden path, using traditional methods for cooking data and asterisk hunting, we are in danger of losing attention, respect, and funding.

Much has been written about how we might tackle the so-called “replication crisis.” There have been four lines of attack. First, there have been calls for greater openness and transparency ( Nosek et al., 2015 ). Second, a case has been made for better training in methods (e.g., Rousselet, Pernet, & Wilcox, 2017 ). Third, it has been argued we need to change the way research has been conducted to incorporate pre-registration of research protocols, preferably in the format of Registered Reports, which are peer-reviewed prior to data collection ( Chambers, 2019 ). Fourth, it is recognised that for too long, the incentive structure of research has prioritised innovative, groundbreaking results over methodological quality. Indeed, Smaldino and McElreath (2016) suggested that one can model the success of scientists in a field as an evolutionary process, where prestigious publications lead to survival, leaving those whose work is less exciting to wither away and leave science. The common thread to these efforts is that they locate the mechanisms of bad science at the systemic level, in ways in which cultures and institutions reinforce norms and distribute resources. The solutions are, therefore, aimed at correcting these shortcomings by creating systems that make good behaviour easier and more rewarding and make poor behaviour more costly.

My view, however, is that institutional shortcomings are only part of the story: to improve scientific research, we also need to understand the mechanisms that maintain bad practices in individual humans. Bad science is usually done because somebody mistook it for good science. Understanding why individual scientists mistake bad science for good, and helping them to resist these errors, is a necessary component of the movement to improve psychology. I will argue that we need to understand how cognitive constraints lead to faulty reasoning if we are to get science back on course and persuade those who set the incentives to reform. Fortunately, as psychologists, we are uniquely well positioned to tackle this issue.

Experimental psychology has a rich tradition of studying human reasoning and decision-making, documenting the flaws and foibles that lead us to selectively process some types of information, make judgements on the basis of incomplete evidence, and sometimes behave in ways that seem frankly irrational. This line of work has had significant application to economics, politics, business studies, and law, but, with some notable exceptions (e.g., Hossenfelder, 2018 ; Mahoney, 1976 ), it has seldom been considered when studying the behaviour of research scientists. In what follows, I consider how our knowledge of human cognition can make sense of problematic scientific practices, and I propose ways we might use this information to find solutions.

Cognitive constraints that affect how psychological science is done

Table 1 lists four characteristics of human cognition that I focus on: I refer to these as “constraints” because they limit how we process, understand, or remember information, but it is important to note that they include some biases that can be beneficial in many contexts. The first constraint is confirmation bias. As Hahn and Harris (2014) noted, a range of definitions of “confirmation bias” exist—here, I will define it as the tendency to seek out evidence that supports our position. A further set of constraints has to do with understanding of probability. A lack of an intuitive grasp of probability contributes to both neglect of statistical power in study design and p -hacking in data analysis. Third, there is an asymmetry in moral reasoning that can lead us to treat errors of omission as less culpable than errors of commission, even when their consequences are equally serious ( Haidt & Baron, 1996 ). The final constraint featured in Bartlett’s (1932) work: reliance on cognitive schemata to fill in unstated information, leading to “reconstructive remembering,” which imbues memories with meaning while filtering out details that do not fit preconceptions.

Different types of cognitive constraints.

Cognitive constraintDescription
Confirmation biasTendency to seek out and remember evidence that supports a preferred viewpoint
Misunderstanding of probability(a) Failure to understand how estimation scales with sample size
(b) Failure to understand that probability depends on context
Asymmetric moral reasoningErrors of omission judged less seriously than errors of commission
Reliance on schemataPerceiving and/or remembering in line with pre-existing knowledge, leading to omission or distortion of irrelevant information

In what follows, I illustrate how these constraints assume particular importance at different stages of the research process, as shown in Table 2 .

Cognitive constraints that operate at different stages of the research process.

Stage of researchCognitive constraint
Experimental designConfirmation bias: looking for evidence consistent with theory
Statistical misunderstanding: power
Data analysisStatistical misunderstanding: -hacking
Moral asymmetry: omission and “paltering” deemed acceptable
Scientific reportingConfirmation bias in reviewing literature
Moral asymmetry: omission and “paltering” deemed acceptable
Cognitive schemata: need for narrative, HARKing

HARKing: hypothesising after the results are known.

Bias in experimental design

Confirmation bias and the failure to consider alternative explanations.

Scientific discovery involves several phases: the researcher needs to (a) assemble evidence, (b) look for meaningful patterns and regularities in the data, (c) formulate a hypothesis, and (d) test it empirically by gathering informative new data. Steps (a)–(c) may be designated as exploratory and step (d) as hypothesis testing or confirmatory ( Wagenmakers, Wetzels, Borsboom, van der Mass, & Kievit, 2012 ). Importantly, the same experiment cannot be used to both formulate and confirm a hypothesis. In practice, however, the distinction between the two types of experiment is often blurred.

Our ability to see patterns in data is vital at the exploratory stage of research: indeed, seeing something that nobody else has observed is a pinnacle of scientific achievement. Nevertheless, new ideas are often slow to be accepted, precisely because they do not fit the views of the time. One such example is described by Zilles and Amunts (2010) : Brodmann’s cytoarchitectonic map of the brain, described in 1909. This has stood the test of time and is still used over 100 years later, but for several decades, it was questioned by those who could not see the fine distinctions made by Brodmann. Indeed, criticisms of poor reproducibility and lack of objectivity were levelled against him.

Brodmann’s case illustrates that we need to be cautious about dismissing findings that depend on special expertise or unique insight of the observer. However, there are plenty of other instances in the history of science where invalid ideas persisted, especially if proposed by an influential or charismatic figure. Entire edifices of pseudoscience have endured because we are very bad at discarding theories that do not work; as Bartlett (1932) would predict, new information that is consistent with the theory will strengthen its representation in our minds, but inconsistent information will be ignored. Examples from the history of science include the rete mirabile , a mass of intertwined arteries that is found in sheep but wrongly included in anatomical drawings of humans for over 1,000 years because of the significance attributed to this structure by Galen ( Bataille et al., 2007 ); the planet Vulcan, predicted by Newton’s laws and seen by many astronomers until its existence was disproved by Einstein’s discoveries ( Levenson, 2015 ); and N-rays, non-existent rays seen by at least 40 people and analysed in 3,090 papers by 100 scientists between 1903 and 1906 ( Nye, 1980 ).

Popper’s (1934/ 1959 ) goal was to find ways to distinguish science from pseudoscience, and his contribution to philosophy of science was to emphasise that we should be bold in developing ideas but ruthless in attempts to falsify them. In an early attempt to test scientists’ grasp of Popperian logic, Mahoney (1976) administered a classic task developed by Wason (1960) to 84 scientists (physicists, biologists, psychologists, and sociologists). In this deceptively simple task, people are shown four cards and told that each card has a number on one side and a patch of colour on the other side. The cards are placed to show number 3, number 8, red, and blue, respectively (see Figure 2 ). The task is to identify which cards need to be turned over to test the hypothesis that if an even number appears on one side, then the opposite side is red. The subject can pick any number of cards. The correct response is to name the two cards that could disconfirm the hypothesis—the number 8 and the blue card. Fewer than 10% of the scientists tested by Mahoney identified both critical cards, more often selecting the number 8 and the red card.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-fig2.jpg

Wason’s (1960) task: The subject is told, “Each card has a number on one side and a patch of colour on the other. You are asked to test the hypothesis that—for these 4 cards—if an even number appears on one side, then the opposite side is red. Which card(s) would you turn over to test the hypothesis?”

Although this study was taken as evidence of unscientific reasoning by scientists, that conclusion has since been challenged by those who have criticised both Popperian logic, in general, and the Wason selection task, in particular, as providing an unrealistic test of human rationality. For a start, the Wason task uses a deterministic hypothesis that can be disproved by a single piece of evidence. This is not a realistic model of biological or behavioural sciences, where we seldom deal with deterministic phenomena. Consider the claim that smoking causes lung cancer. Most of us accept that this is so, even though we know there are people who smoke and who do not get lung cancer and people who get lung cancer but never smoked. When dealing with probabilistic phenomena, a Bayesian approach makes more sense, whereby we consider the accumulated evidence to determine the relative likelihood of one hypothesis over another (as illustrated in Figure 1 ). Theories are judged as more or less probable, rather than true or false. Oaksford and Chater (1994) showed that, from a Bayesian perspective, typical selections made on the Wason task would be rational in contexts where the antecedent and consequent of the hypothesis (an even number and red colour) were both rare. Subsequently, Perfors and Navarro (2009) concluded that in situations where rules are relevant only for a minority of entities, then confirmation bias is an efficient strategy.

This kind of analysis has shifted the focus to discussions about how far, and under what circumstances, people are rational decision-makers. However, it misses a key point about scientific reasoning, which is that it involves an active process of deciding which evidence to gather, rather than merely a passive evaluation of existing evidence. It seems reasonable to conclude that, when presented with a particular set of evidence, people generally make decisions that are rational when evaluated against Bayesian standards. However, history suggests that we are less good at identifying which new evidence needs to be gathered to evaluate a theory. In particular, people appear to have a tendency to accept a hypothesis on the basis of “good enough” evidence, rather than actively seeking evidence for alternative explanations. Indeed, an early study by Doherty, Mynatt, Tweney, and Schiavo (1979) found that, when given an opportunity to select evidence to help decide which of two hypotheses was true (in a task where a fictitious pot had to be assigned as originating from one of the two islands that differed in characteristic features), people seemed unable to identify which information would be diagnostic and tended, instead, to select information that could neither confirm nor disconfirm their hypothesis.

Perhaps the strongest evidence for our poor ability to consider alternative explanations comes from the history of the development of clinical trials. Although James Lind is credited with doing the first trials for treatment of scurvy in 1747, it was only in 1948 that the randomised controlled trial became the gold standard for evaluating medical interventions ( Vallier & Timmerman, 2008 ). The need for controls is not obvious, and people who are not trained in this methodology will often judge whether a treatment is effective on the basis of a comparison on an outcome measure between a pre-treatment baseline and a post-treatment evaluation. The logic is that if a group of patients given the treatment does not improve, the treatment did not work. If they do show meaningful gains, then it did work. And we can even embellish this comparison with a test of statistical significance. This reasoning can be seen as entirely rational, and this can explain why so many people are willing to accept that alternative medicine is effective.

The problem with this approach is that the pre–post intervention comparison allows important confounds to creep in. For instance, early years practitioners argue that we should identify language problems in toddlers so that we can intervene early. They find that if 18-month-old late talkers are given intervention, only a minority still have language problems at 2 years and, therefore, conclude the intervention was effective. However, if an untreated control group is studied over the same period, we find very similar rates of improvement ( Wake et al., 2011 )—presumably due to factors such a spontaneous resolution of problems or regression to the mean, which will lead to systematic bias in outcomes. Researchers need training to recognise causes of bias and to take steps to overcome them: thinking about possible alternative explanations of an observed phenomenon does not come naturally, especially when the preliminary evidence looks strong.

Intervention studies provide the clearest evidence of what I term “premature entrenchment” of a theory: some other examples are summarised in Table 3 . Note that these examples do not involve poor replicability, quite the opposite. They are all cases where an effect, typically an association between variables, is reliably observed, and researchers then converge on accepting the most obvious causal explanation, without considering lines of evidence that might point to alternative possibilities.

Premature entrenchment: examples where the most obvious explanation for an observed association is accepted for many years, without considering alternative explanations that could be tested with different evidence.

ObservationFavoured explanationAlternative explanationEvidence for alternative explanation
Home literacy environment predicts reading outcomes in childrenAccess to books at home affects children’s learning to read ( )Parents and children share genetic risk for reading problemsChildren who are poor readers tend to have parents who are poor readers ( )
Speech sounds (phonemes) do not have consistent auditory correlates but can be identified by knowledge of articulatory configurations used to produce themMotor theory of speech perception: we learn to recognise speech by mapping input to articulatory gestures ( )Correlations between perception and production reflect co-occurrence rather than causationChildren who are congenitally unable to speak can develop good speech perception, despite having no articulatory experience ( )
Dyslexics have atypical brain responses to speech when assessed using fMRIAtypical brain organisation provides evidence that dyslexia is a “real disorder” with a neurobiological basis ( )Atypical responses to speech in the brain are a consequence of being a poor readerAdults who had never been taught to read have atypical brain organisation for spoken language ( )

fMRI: functional magnetic resonance imaging.

Premature entrenchment may be regarded as evidence that humans adopt Bayesian reasoning: we form a prior belief about what is the case and then require considerably more evidence to overturn that belief than to support it. This would explain why, when presented with virtually identical studies that either provided support for or evidence against astrology, psychologists were more critical of the latter ( Goodstein & Brazis, 1970 ). The authors of that study expressed concern about the “double standard” shown by biased psychologists who made unusually harsh demands of research in borderline areas, but from a Bayesian perspective, it is reasonable to use prior knowledge so that extraordinary claims require extraordinary evidence. Bayesian reasoning is useful in many situations: it allows us to act decisively on the basis of our long-term experience, rather than being swayed by each new incoming piece of data. However, it can be disastrous if we converge on a solution too readily on the basis of incomplete or inaccurate information. This will be exacerbated by publication bias, which distorts the evidential landscape.

For many years, the only methods available to counteract the tendency for premature entrenchment were exhortations to be self-critical (e.g., Feynman, 1974 ) and peer review. The problem with peer review is that it typically comes too late to be useful, after research is completed. In the final section of this article, I will consider some alternative approaches that bring in external appraisal of experimental designs at an earlier stage in the research process.

Misunderstanding of probability leading to underpowered studies

Some 17 years after Cohen’s seminal work on statistical power, Newcombe (1987) wrote,

Small studies continue to be carried out with little more than a blind hope of showing the desired effect. Nevertheless, papers based on such work are submitted for publication, especially if the results turn out to be statistically significant. (p. 657)

In clinical medicine, things have changed, and the importance of adequate statistical power is widely recognised among those conducting clinical trials. But in psychology, the “blind hope” has persisted, and we have to ask ourselves why this is.

My evidence here is anecdotal, but the impression is that many psychologists simply do not believe advice about statistical power, perhaps because there are so many underpowered studies published in the literature. When a statistician is consulted about sample size for a study, he or she will ask the researcher to estimate the anticipated effect size. This usually leads to a sample size estimate that is far higher than the researcher anticipated or finds feasible, leading to a series of responses not unlike the first four of the five stages of grief: denial, anger, bargaining, and depression. The final stage, acceptance, may, however, not be reached.

Of course, there are situations where small sample sizes are perfectly adequate: the key issue is how large the effect of interest is in relation to the variance. In some fields, such as psychophysics, you may not even need statistics—the famous “interocular trauma” test (referring to a result so obvious and clear-cut that it hits you between the eyes) may suffice. Indeed, in such cases, recruitment of a large sample would just be wasteful.

There are, however, numerous instances in psychology where people have habitually used sample sizes that are too small to reliably detect an effect of interest: see, for instance, the analysis by Poldrack et al. (2017) of well-known effects in functional magnetic resonance imaging (fMRI) or Oakes (2017) on looking-time experiments in infants. Quite often, a line of research is started when a large effect is seen in a small sample, but over time, it becomes clear that this is a case of “winner’s curse,” a false positive that is published precisely because it looks impressive but then fails to replicate when much larger sample sizes are used. There are some recent examples from studies looking at neurobiological or genetic correlates of individual differences, where large-scale studies have failed to support previously published associations that had appeared to be solid (e.g., De Kovel & Francks, 2019 , on genetics of handedness; Traut et al., 2018 , on cerebellar volume in autism; Uddén et al., 2019 , on genetic correlates of fMRI language-based activation).

A clue to the persistence of underpowered psychology studies comes from early work by Tversky and Kahneman (1971 , 1974 ). They studied a phenomenon that they termed “belief in the law of small numbers,” an exaggerated confidence in the validity of conclusions based on small samples, and showed that even those with science training tended to have strong intuitions about random sampling that were simply wrong. They illustrated this with the following problem:

A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50% of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50%, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60% of the babies born were boys. Which hospital do you think recorded more such days? 1. The large hospital 2. The small hospital 3. About the same (that is, within 5% of each other)

Most people selected Option 3, whereas, as illustrated in Figure 3 , Option 2 is the correct answer—with only 15 births per day, the day-to-day variation in the proportion of boys will be much higher than with 45 births per day, and hence, more days will have more than 60% boys. One reason why our intuitions deceive us is because the sample size does not affect the average percentage of male births in the long run: this will be 50%, regardless of the hospital size. But sample size has a dramatic impact on the variability in the proportion of male births from day to day. More formally, if you have a big and small sample drawn from the same population, the expected estimate of the mean will be the same, but the standard error of that estimate will be greater for the small sample.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-fig3.jpg

Simulated data showing proportions of males born in a small hospital with 15 births per day versus a large hospital with 45 births per day. The small hospital has more days where more than 60% of births are boys (points above red line).

Statistical power depends on the effect size, which, for a simple comparison of two means, can be computed as the difference in means divided by the pooled standard deviation. It follows that power is crucially dependent on the proportion of variance in observations that is associated with an effect of interest, relative to background noise. Where variance is high, it is much harder to detect the effect, and hence, small samples are often underpowered. Increasing the sample size is not the only way to improve power: other options include improving the precision of measurement, using more effective manipulations, or adopting statistical approaches to control noise ( Lazic, 2018 ). But in many situations, increasing the sample size is the preferred approach to enhance statistical power to detect an effect.

Bias in data analysis: p -hacking

P -hacking can take various forms, but they all involve a process of selective analysis. Suppose some researchers hypothesise that there is an association between executive function and implicit learning in a serial reaction time task, and they test this in a study using four measures of executive function. Even if there is only one established way of scoring each task, they have four correlations; this means that the probability that none of the correlations is significant at the .05 level is .95 4 —i.e., .815—and conversely, the probability that at least one is significant is .185. This probability can be massaged to even higher levels if the experimenters look at the data and then select an analytic approach that maximises the association: maybe by dropping outliers, by creating a new scoring method, combining measures in composites, and so on. Alternatively, the experimenters may notice that the strength of the correlation varies with the age or sex of participants and so subdivide the sample to coax at least a subset of data into significance. The key thing about p -hacking is that at the end of the process, the researchers selectively report the result that “worked,” with the implication that the p -value can be interpreted at face value. But it cannot: probability is meaningless if not defined in terms of a particular analytic context. P -hacking appears to be common in psychology ( John, Loewenstein, & Prelec, 2012 ). I argue here that this is because it arises from a conjunction of two cognitive constraints: failure to understand probability, coupled with a view that omission of information when reporting results is not a serious misdemeanour.

Failure to understand probability

In an influential career guide, published by the American Psychological Association, Bem (2004) explicitly recommended going against the “conventional view” of the research process, as this might lead us to miss exciting new findings. Instead readers were encouraged to

become intimately familiar with . . . the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find additional evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something—anything—interesting. (p. 2)

For those who were concerned this might be inappropriate, Bem offered reassurance. Everything is fine because what you are doing is exploring your data. Indeed, he implied that anyone who follows the “conventional view” would be destined to do boring research that nobody will want to publish.

Of course, Bem (2004) was correct to say that we need exploratory research. The problem comes when exploratory research is repackaged as if it were hypothesis testing, with the hypothesis invented after observing the data (HARKing), and the paper embellished with p -values that are bound to be misleading because they were p -hacked from numerous possible values, rather than derived from testing an a priori hypothesis. If results from exploratory studies were routinely replicated, prior to publication, we would not have a problem, but they are not. So why did the American Psychological Association think it appropriate to publish Bem’s views as advice to young researchers? We can find some clues in the book overview, which explains that there is a distinction between the “formal” rules that students are taught and the “implicit” rules that are applied in everyday life, concluding that “This book provides invaluable guidance that will help new academics plan, play, and ultimately win the academic career game.” Note that the stated goal is not to do excellent research: it is to have “a lasting and vibrant career.” It seems, then, that there is recognition here that if you do things in the “conventional” way, your career will suffer. It is clear from Bem’s framing of his argument that he was aware that his advice was not “conventional,” but he did not think it was unethical—indeed, he implied it would be unfair on young researchers to do things conventionally as that will prevent them making exciting discoveries that will enable them to get published and rise up the academic hierarchy. While it is tempting to lament the corruption of a system that treats an academic career as a game, it is more important to consider why so many people genuinely believe that p -hacking is a valid, and indeed creative, approach to doing research.

The use of null-hypothesis significance testing has attracted a lot of criticism, with repeated suggestions over the years that p -values be banned. I favour the more nuanced view expressed by Lakens (2019) , who suggests that p -values have a place in science, provided they are correctly understood and used to address specific questions. There is no doubt, however, that many people do misunderstand the p -value. There are many varieties of misunderstanding, but perhaps the most common is to interpret the p -value as a measure of strength of evidence that can be attached to a given result, regardless of the context. It is easy to see how this misunderstanding arises: if we hold the sample size constant, then for a single comparison, there will be a linear relationship between the p -value and the effect size. However, whereas an effect size remains the same, regardless of the analytic context, a p -value is crucially context-dependent.

Suppose in the fictitious study of executive function described above, the researchers have 20 participants and four measures of executive function (A–D) that correlate with implicit learning with r values of .21, .47, .07, and −.01. The statistics package tells us that the corresponding two-tailed p -values are .374, .037, .769, and .966. A naive researcher may rejoice at having achieved significance with the second correlation. However, as noted above, the probability that at least one correlation of the four will have an associated p -value of less than .05 is 18%, not 5%. If we want to identify correlations that are unlikely under the null hypothesis, then we need to correct the alpha level (e.g., by doing a Bonferroni correction to adjust by the number of tests, i.e., .05/4 = .0125). At this point, the researchers see their significant result snatched from their grasp. This creates a strong temptation to just drop the three non-significant tests and not report them. Alternatively, one sometimes sees papers that report the original p -value but then state that it “did not survive” Bonferroni correction, but they, nevertheless, exhume it and interpret the uncorrected value. Researchers acting this way may not think that they are doing anything inappropriate, other than going against advice of pedantic statisticians, especially given Bem’s (2004) advice to follow the “implicit” rather than “formal” rules of research. However, this is simply wrong: as illustrated above, a p -value can only be interpreted in relation to the context in which it is computed.

One way of explaining the notion of p -hacking is to use the old-fashioned method of games of chance. I find this scenario helpful: we have a magician who claims he can use supernatural powers to deal a poker hand of “three of a kind” from an unbiased deck of cards. This type of hand will occur in around 1 of 50 draws from an unbiased deck. He points you to a man who, to his amazement, finds that his hand contains three of a kind. However, you then discover he actually tried his stunt with 50 people, and this man was the only one who got three of a kind. You are rightly disgruntled. This is analogous to p -hacking. The three-of-a-kind hand is real enough, but its unusualness, and hence its value as evidence of the supernatural, depends on the context of how many tests were done. The probability that needs to be computed here is not the probability of one specific result but rather the probability that specific result would come up at least once in 50 trials.

Asymmetry of sins of omission and commission

According to Greenwald (1975) “[I]t is a truly gross ethical violation for a researcher to suppress reporting of difficult-to-explain or embarrassing data to present a neat and attractive package to a journal editor” (p. 19).

However, this view is not universal.

Greenwald’s focus was on publication bias, i.e., failure to report an entire study, but the point he made about “prejudice” against null results also applies to cases of p -hacking where only “significant” results are reported, whereas others go unmentioned. It is easy to see why scientists might play down the inappropriateness of p -hacking, when it is so important to generate “significant” findings in a world with a strong prejudice against null results. But I suspect another reason why people tend to underrate the seriousness of p -hacking is because it involves an error of omission (failing to report the full context of a p -value), rather than an error of commission (making up data).

In studies of morality judgement, errors of omission are generally regarded as less culpable than errors of commission (see, e.g., Haidt & Baron, 1996 ). Furthermore, p -hacking may be seen to involve a particularly subtle kind of dishonesty because the statistics and their associated p -values are provided by the output of statistics software. They are mathematically correct when testing a specific, prespecified hypothesis: the problem is that, without the appropriate context, they imply stronger evidence than is justified. This is akin to what Rogers, Zeckhauser, Gino, Norton, and Schweitzer (2017) have termed “paltering,” i.e., the use of truthful statements to mislead, a topic they studied in the context of negotiations. An example was given of a person trying to sell a car that had twice needed a mechanic to fix it. Suppose the potential purchaser directly asks “Has the car ever had problems?” An error of commission is to deny the problems, but a paltering answer would be “This car drives very smoothly and is very responsive. Just last week it started up with no problems when the temperature was −5 degrees Fahrenheit.” Rogers et al. showed that negotiators were more willing to palter than to lie, although potential purchasers regarded paltering as only marginally less immoral than lying.

Regardless of the habitual behaviour of researchers, the general public does not find p -hacking acceptable. Pickett and Roche (2018) did an M-Turk experiment in which a community sample was asked to judge the morality of various scenarios, including this one:

A medical researcher is writing an article testing a new drug for high blood pressure. When she analyzes the data with either method A or B, the drug has zero effect on blood pressure, but when she uses method C, the drug seems to reduce blood pressure. She only reports the results of method C, which are the results that she wants to see.

Seventy-one percent of respondents thought this behaviour was immoral, 73% thought the researcher should receive a funding ban, and 63% thought the researcher should be fired.

Nevertheless, although selective reporting was generally deemed immoral, data fabrication was judged more harshly, confirming that in this context, as in those studied by Haidt and Baron (1996) , sins of commission are taken more seriously than errors of omission.

If we look at the consequences of a specific act of p -hacking, it can potentially be more serious than an act of data fabrication: this is most obvious in medical contexts, where suppression of trial results, either by omitting findings from within a study or by failing to publish studies with null results, can provide a badly distorted basis for clinical decision-making. In their simulations of evidence cumulation, Nissen et al. (2016) showed how p -hacking could compound the impact of publication bias and accelerate the premature “canonization” of theories; the alpha level that researchers assume applies to experimental results is distorted by p -hacking, and the expected rate of false positives is actually much higher. Furthermore, p -hacking is virtually undetectable because the data that are presented are real, but the necessary context for their interpretation is missing. This makes it harder to correct the scientific record.

Bias in writing up a study

Most writing on the “replication crisis” focuses on aspects of experimental design and observations, data analysis, and scientific reporting. The resumé of literature that is found in the introduction to empirical papers, as well as in literature review articles, is given less scrutiny. I will make the case that biased literature reviews are universal and have a major role in sustaining poor reproducibility because they lead to entrenchment of false theories, which are then used as the basis for further research.

It is common to see biased literature reviews that put a disproportionate focus on findings that are consistent with the author’s position. Researchers who know an area well may be aware of this, especially if their own work is omitted, but in general, cherry-picking of evidence is hard to detect. I will use a specific paper published in 2013 to illustrate my point, but I will not name the authors, as it would be invidious to single them out when the kinds of bias in their literature review are ubiquitous. In their paper, my attention was drawn to the following statement in the introduction:

Regardless of etiology, cerebellar neuropathology commonly occurs in autistic individuals. Cerebellar hypoplasia and reduced cerebellar Purkinje cell numbers are the most consistent neuropathologies linked to autism. … MRI studies report that autistic children have smaller cerebellar vermal volume in comparison to typically developing children.

I was surprised to read this because a few years ago, I had attended a meeting on neuroanatomical studies of autism and had come away with the impression that there were few consistent findings. I did a quick search for an up-to-date review, which turned up a meta-analysis ( Traut et al., 2018 ), that included 16 MRI studies published between 1997 and 2010, five of which reported larger cerebellar size in autism and one of which found smaller cerebellar size. In the article I was reading, one paper had been cited to support the MRI statement, but it referred to a study where the absolute size of the vermis did not differ from typically developing children but was relatively small in the autistic participants, after the overall (larger) size of the cerebellum had been controlled for.

Other papers cited to support the claims of cerebellar neuropathology included a couple of early post mortem neuroanatomical studies, as well as two reviews. The first of these ( DiCicco-Bloom et al., 2006 ) summarised presentations from a conference and supported the claims made by the authors. The other one, however ( Palmen, van Engeland, Hof, & Schmitz, 2004 ), expressed more uncertainty and noted a lack of correspondence between early neuroanatomical studies and subsequent MRI findings, concluding,

Although some consistent results emerge, the majority of the neuropathological data remain equivocal. This may be due to lack of statistical power, resulting from small sample sizes and from the heterogeneity of the disorder itself, to the inability to control for potential confounding variables such as gender, mental retardation, epilepsy and medication status, and, importantly, to the lack of consistent design in histopathological quantitative studies of autism published to date.

In sum, a confident statement “cerebellar neuropathology commonly occurs in autistic individuals,” accompanied by a set of references, converged to give the impression that there is consensus that the cerebellum is involved in autism. However, when we drill down, we find that the evidence is uncertain, with discrepancies between neuropathological studies and MRI and methodological concerns about the former. Meanwhile, this study forms part of a large body of research in which genetically modified mice with cerebellar dysfunction are used as an animal model of autism. My impression is that few of the researchers using these mouse models appreciate that the claim of cerebellar abnormality in autism is controversial among those working with humans because each paper builds on the prior literature. There is entrenchment of error, for two reasons. First, many researchers will take at face value the summary of previous work in a peer-reviewed paper, without going back to original cited sources. Second, even if a researcher is careful and scholarly and does read the cited work, they are unlikely to find relevant studies that were not included in the literature review.

It is easy to take an example like this and bemoan the lack of rigour in scientific writing, but this is to discount cognitive biases that make it inevitable that, unless we adopt specific safeguards against this, cherry-picking of evidence will be the norm. Three biases lead us in this direction: confirmation bias, moral asymmetry, and reliance on schemata.

Confirmation bias: cherry-picking prior literature

A personal example may serve to illustrate the way confirmation bias can operate subconsciously. I am interested in genetic effects on children’s language problems, and I was in the habit of citing three relevant twin studies when I gave talks on this topic. All these obtained similar results, namely that there was a strong genetic component to developmental language disorders, as evidenced by much higher concordance for disorder in pairs of monozygotic versus dizygotic twins. In 2005 , however, Hayiou-Thomas, Oliver, and Plomin published a twin study with very different findings, with low twin/co-twin concordance, regardless of zygosity. It was only when I came to write a review of this area and I checked the literature that I realised I had failed to mention the 2005 study in talks for a year or two, even though I had collaborated with the authors and was well aware of the findings. I had formed a clear view on heritability of language disorders, and so I had difficulty remembering results that did not agree. Subsequently, I realised we should try to understand why this study obtained different results and found a plausible explanation ( Bishop & Hayiou-Thomas, 2008 ). But I only went back for a further critical look at the study because I needed to make sense of the conflicting results. It is inevitable that we behave this way as we try to find generalisable results from a body of work, but it creates an asymmetry of attention and focus between work that we readily accept, because it fits, and work that is either forgotten or looked at more critically, because it does not.

A particularly rich analysis of citation bias comes from a case study by Greenberg (2009) , who took as his starting point papers concerned with claims that a protein, β amyloid, was involved in causing a specific form of muscle disease. Greenberg classified papers according to whether they were positive, negative, or neutral about this claim and carried out a network analysis to identify influential papers (those with many citations). He found that papers that were critical of the claim received far fewer citations than those that supported it, and this was not explained by lower quality. Animal model studies were almost exclusively justified by selective citation of positive studies. Consistent with the idea of “reconstructive remembering,” he also found instances where cited content was distorted, as well as cases where influential review papers amplified citation bias by focusing attention only on positive work. The net result was an information (perhaps better termed a disinformation) cascade that would lead to a lack of awareness of critical data, which never gets recognised. In effect, when we have agents that adopt Bayesian reasoning, if they are presented with distorted information, this creates a positive feedback loop that leads to increasing bias in the prior. Viewed this way, we can start to see how omission of relevant citations is not a minor peccadillo but a serious contributor to entrenchment of error. Further evidence of the cumulative impact of citation bias is shown in Figure 4 , which uses studies of intervention for depression. Because studies in this area are registered, it is possible to track the fate of unpublished as well as published studies. The researchers showed that studies with null results are far less likely to be published than those with positive findings, but even if the former are published, there is a bias against citing them.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-fig4.jpg

The cumulative impact of reporting and citation biases on the evidence base for antidepressants. (a) Displays the initial, complete cohort of trials that were recorded in a registry, while (b) through (e) show the cumulative effect of biases. Each circle indicates a trial, while the colour indicates whether results were positive or negative or were reported to give a misleadingly positive impression(spin). Circles connected by a grey line indicate trials from the same publication. The progression from (a) to (b) shows that nearly all the positive trials but only half of those with null results were published, and reporting of null studies showed (c) bias or (d) spin in what was reported. In (e), the size of the circle indicates the (relative) number of citations received by that category of studies.

Source. Reprinted with permission from De Vries et al. (2018) .

While describing such cases of citation bias, it is worth pausing to consider one of the best-known examples of distorted thinking: experimenter bias. This is similar to confirmation bias, but rather than involving selective attention to specific aspects of a situation that fits with our preconceptions, it has a more active character, whereby the experimenter can unwittingly influence the outcome of a study. The best-known research on this topic was the original Rosenthal and Fode (1963) study, where students were informed that the rats they were studying were “maze-bright” or “maze-dull,” when in fact they did not differ. Nevertheless, the “maze-bright” group learned better, suggesting that the experimenter would try harder to train an animal thought to have potential. A related study by Rosenthal and Jacobson (1963) claimed that if teachers were told that a test had revealed that specific pupils were “ready to bloom,” they would do better on an IQ test administered at the end of the year, even though the children so designated were selected at random.

Both these studies are widely cited. It is less well known that work on experimenter bias was subjected to a scathing critique by Barber and Silver (1968) , entitled “Fact, fiction and the experimenter bias effect,” in which it was noted that work in this area suffered from poor methodological quality, in particular p -hacking. Barber and Silver did not deny that experimenter bias could affect results, but they concluded that these effects were far less common and smaller in magnitude than those implied by Rosenthal’s early work. Subsequently, Barber (1976) developed this critique further in his book Pitfalls in Human Research. Yet Rosenthal’s work is more highly cited and better remembered than that of Barber.

Rosenthal’s work provides a cautionary tale: although confirmation bias helps explain distorted patterns of citation, the evidence for maladaptive cognitive biases has been exaggerated. Furthermore, studies on confirmation bias often use artificial experiments, divorced from real life, and the criteria for deciding that reasoning is erroneous are often poorly justified ( Hahn & Harris, 2014 ). In future, it would be worthwhile doing more naturalistic explorations of people’s memory for studies that do and do not support a position when summarising scientific literature.

On a related point, in using confirmation bias as an explanation for persistence of weak theories, there is a danger that I am falling into exactly the trap that I am describing. For instance, I was delighted to find Greenberg’s (2009) paper, as it chimed very well with my experiences when reading papers about cerebellar deficits in autism. But would I have described and cited it here if it had shown no difference between citations for papers that did and did not support the β amyloid claim? Almost certainly not. Am I going to read all literature on citation bias to find out how common it is? That strategy would soon become impossible if I tried to do it for every idea I touch upon in this article.

Moral asymmetry between errors of omission and commission

The second bias that fortifies the distortions in a literature review is the asymmetry of moral judgement that I referred to when discussing p -hacking. To my knowledge, paltering has not been studied in the context of literature reviews, but my impression is that selective presentation of results that fit, while failing to mention important contextual factors (e.g., the vermis in those with autism is smaller but only when you have covaried for the total cerebellar size), is common. How far this is deliberate or due to reconstructive remembering, however, is impossible to establish.

It would also be of interest to conduct studies on people’s attitudes to the acceptability of cherry-picking of literature versus paltering (misleadingly selective reporting) or invention of a study. I would anticipate that most would regard cherry-picking as fairly innocuous, for several reasons: first, it could be an unintended omission; second, the consequences of omitting material from a review may be seen as less severe than introducing misinformation; and third, selective citation of papers that fit a narrative can have a positive benefit in terms of readability. There are also pragmatic concerns: some journals limit the word count for an introduction or reference list so that full citation of all relevant work is not possible and, finally, sanctioning people for harmful omissions would create apparently unlimited obligations ( Haidt & Baron, 1996 ). Quite simply, there is far too much literature for even the most diligent scholar to read.

Nevertheless, consequences of omission can be severe. The above examples of research on the serotonin transporter gene in depression, or cerebellar abnormality in autism, emphasise how failure to cite earlier null results can lead to a misplaced sense of confidence in a phenomenon, which is wasteful in time and money when others attempt to build on it. And the more we encounter a claim, the more likely it is to be judged as true, regardless of actual accuracy (see Pennycook, Cannon, & Rand, 2018 , for a topical example). As Ingelfinger (1976) put it, “faulty or inappropriate references . . . like weeds, tend to reproduce themselves and so permit even the weakest of allegations to acquire, with repeated citation, the guise of factuality” (p. 1076).

Reliance on schemata

Our brains cannot conceivably process all the information around us: we have to find a way to select what is important to function and survive. This involves a search for meaningful patterns, which once established, allow us to focus on what is relevant and ignore the rest. Scientific discovery may be seen as an elevated version of pattern discovery: we see the height of scientific achievement as discovering regularities in nature that allow us to make better predictions about how the world behaves and to create new technologies and interventions from the basic principles we have discovered.

Scientific progress is not a simple process of weighing up competing pieces of evidence in relation to a theory. Rather than simply choosing between one hypothesis and another, we try to understand a problem in terms of a schema. Bartlett (1932) was one of the first psychologists to study how our preconceptions, or schemata, create distortions in perception and memory. He introduced the idea of “reconstructive remembering,” demonstrating how people’s memory of a narrative changed over time in specific ways, to become more coherent and aligned with pre-existing schemata.

Bartlett’s (1932) work on reconstructive remembering can explain why we not only tend to ignore inconsistent evidence ( Duyx, Urlings, Swaen, Bouter, & Zeegers, 2017 ) but also are prone to distort the evidence that we do include ( Vicente & Brewer, 1993 ). If we put together the combined influence of confirmation bias and reconstructive remembering, it suggests that narrative literature reviews have a high probability of being inaccurate: both types of bias will lead to a picture of research converging on a compelling story, when the reality may be far less tidy ( Katz, 2013 ).

I have focused so far on bias in citing prior literature, but schemata also influence how researchers go about writing up results. If we just were to present a set of facts that did not cohere, our work would be difficult to understand and remember. As Chalmers, Hedges, and Cooper (2002) noted, this point was made in 1885 by Lord Raleigh in a presidential address to the British Association for the Advancement of Science:

If, as is sometimes supposed, science consisted in nothing but the laborious accumulation of facts, it would soon come to a standstill, crushed, as it were, under its own weight. The suggestion of a new idea, or the detection of a law, supersedes much that has previously been a burden on the memory, and by introducing order and coherence facilitates the retention of the remainder in an available form. ( Rayleigh, 1885 , p. 20)

Indeed, when we write up our research, we are exhorted to “tell a story,” which achieves the “order and coherence” that Rayleigh described. Since his time, ample literature on narrative comprehension has confirmed that people fill in gaps in unstated information and find texts easier to comprehend and memorise when they fit a familiar narrative structure ( Bower & Morrow, 1990 ; Van den Broek, 1994 ).

This resonates with Dawkins’ ( 1976 ) criteria for a meme, i.e., an idea that persists by being transmitted from person to person. Memes need to be easy to remember, understand, and communicate, and so narrative accounts make far better memes than dry lists of facts. From this perspective, narrative serves a useful function in providing a scaffolding that facilitates communication. However, while this is generally a useful, and indeed essential, aspect of human cognition, in scientific communication, it can lead to propagation of false information. Bartlett (1932) noted that remembering is hardly ever really exact, “and it is not at all important that it should be so.” He was thinking of the beneficial aspects of schemata, in allowing us to avoid information overload and to focus on what is meaningful. However, as Dawkins emphasised, survival of a meme does not depend on it being useful or true. An idea such as the claim that vaccination causes autism is a very effective meme, but it has led to resurgence of diseases that were close to being eradicated.

In communicating scientific results, we need to strike a fine balance between presenting a precis of findings that is easily communicated and moving towards an increase in knowledge. I would argue the pendulum may have swung too far in the direction of encouraging researchers to tell good narratives. Not just media outlets, but also many journal editors and reviewers, encourage authors to tell simple stories that are easy to understand, and those who can produce these may be rewarded with funding and promotion.

The clearest illustration of narrative supplanting accurate reporting comes from the widespread use of HARKing, which was encouraged by Bem (2004) when he wrote,

There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).

Of course, formulating a hypothesis on the basis of observed data is a key part of the scientific process. However, as noted above, it is not acceptable to use the same data to both formulate and test the hypothesis—replication in a new sample is needed to avoid being misled by the play of chance and littering literature with false positives ( Lazic, 2016 ; Wagenmakers et al., 2012 ).

Kerr (1998) considered why HARKing is a successful strategy and pointed out that it allowed the researcher to construct an account of an experiment that fits a good story script:

Positing a theory serves as an effective “initiating event.” It gives certain events significance and justifies the investigators’ subsequent purposeful activities directed at the goal of testing the hypotheses. And, when one HARKs, a “happy ending” (i.e., confirmation) is guaranteed. (p. 203)

In this regard, Bem’s advice makes perfect sense: “A journal article tells a straightforward tale of a circumscribed problem in search of a solution. It is not a novel with subplots, flashbacks, and literary allusions, but a short story with a single linear narrative line.”

We have, then, a serious tension in scientific writing. We are expected to be scholarly and honest, to report all our data and analyses and not to hide inconvenient truths. At the same time, if we want people to understand and remember our work, we should tell a coherent story from which unnecessary details have been expunged and where we cut out any part of the narrative that distracts from the main conclusions.

Kerr (1998) was clear that HARKing has serious costs. As well as translating type I errors into hard-to-eradicate theory, he noted that it presents a distorted view of science as a process which is far less difficult and unpredictable than is really the case. We never learn what did not work because inconvenient results are suppressed. For early career researchers, it can lead to cynicism when they learn that the rosy picture portrayed in the literature was achieved only by misrepresentation.

Overcoming cognitive constraints to do better science

One thing that is clear from this overview is that we have known about cognitive constraints for decades, yet they continue to affect scientific research. Finding ways to mitigate the impact of these constraints should be a high priority for experimental psychologists. Here, I draw together some general approaches that might be used to devise an agenda for research improvement. Many of these ideas have been suggested before but without much consideration of cognitive constraints that may affect their implementation. Some methods, such as training, attempt to overcome the constraints directly in individuals: others involve making structural changes to how science is done to counteract our human tendency towards unscientific thinking. None of these provides a total solution: rather, the goal is to tweak the dials that dictate how people think and behave, to move us closer to better scientific practices.

It is often suggested that better training is needed to improve replicability of scientific results, yet the focus tends to be on formal instruction in experimental design and statistics. Less attention has been given to engendering a more intuitive understanding of probability, or counteracting cognitive biases, though there are exceptions, such as the course by Steel, Liermann, and Guttorp (2018) , which starts with a consideration of “How the wiring of the human brain leads to incorrect conclusions from data.” One way of inducing a more intuitive sense of statistics and p -values is by using data simulations. Simulation is not routinely incorporated in statistics training, but free statistical software now makes this within the grasp of all ( Tintle et al., 2015 ). This is a powerful way to experience how easy it is to get a “significant” p -value when running multiple tests. Students are often surprised when they generate repeated runs of a correlation matrix of random numbers with, say, five variables and find at least one “significant” correlation in about one in four runs. Once you understand that there is a difference between the probability associated with getting a specific result on a single test, predicted in advance, versus the probability of that result coming up at least once in a multitude of tests, then the dangers of p -hacking become easier to grasp.

Data simulation could also help overcome the misplaced “belief in the law of small numbers” ( Tversky & Kahneman, 1974 ). By generating datasets with a known effect size, and then taking samples from these and subjecting them to statistical test, the student can learn to appreciate just how easy it is to miss a true effect (type II error) if the study is underpowered.

There is small literature evaluating attempts to specifically inoculate people against certain types of cognitive bias. For instance, Morewedge et al. (2015) developed instructional videos and computer games designed to reduce a series of cognitive biases, including confirmation bias, and found these to be effective over the longer term. Typically, however, such interventions focus on hypothetical scenarios outside the scope of experimental psychology. They might improve scientific quality of research projects if adjusted to make them relevant to conducting and appraising experiments.

Triangulation of methods in study design

I noted above that for science to progress, we need to overcome a tendency to settle on the first theory that seems “good enough” to account for observations. Any method that forces the researcher to actively search for alternative explanations is, therefore, likely to stimulate better research.

The notion of triangulation ( Munafò & Davey Smith, 2018 ) was developed in the field of epidemiology, where reliance is primarily on observational data, and experimental manipulation is not feasible. Inferring causality from correlational data is hazardous, but it is possible to adopt a strategic approach of combining complementary approaches to analysis, each of which has different assumptions, strengths, and weaknesses. Epidemiology progresses when different explanations for correlational data are explicitly identified and evaluated, and converging evidence is obtained ( Lawlor, Tilling, & Davey Smith, 2016 ). This approach could be extended to other disciplines, by explicitly requiring researchers to use at least two different methods with different potential biases when evaluating a specific hypothesis.

A “culture of criticism”

Smith (2006) described peer review as “a flawed process, full of easily identified defects with little evidence that it works” (p. 182). Yet peer review provides one way of forcing researchers to recognise when they are so focused on a favoured theory that they are unable to break away. Hossenfelder (2018) has argued that the field of particle physics has stagnated because of a reluctance to abandon theories that are deemed “beautiful.” We are accustomed to regarding physicists as superior to psychologists in terms of theoretical and methodological sophistication. In general, they place far less emphasis than we do on statistical criteria for evidence, and where they do use statistics, they understand probability theory and adopt very stringent levels of significance. Nevertheless, according to Hossenfelder, they are subject to cognitive and social biases that make them reluctant to discard theories. She concludes her book with an Appendix on “What you can do to help,” and as well as advocating better understanding of cognitive biases, she recommends some cultural changes to address these. These include building “a culture of criticism.” In principle, we already have this—talks and seminars should provide a forum for research to be challenged—but in practice, critiquing another’s work is often seen as clashing with social conventions of being supportive to others, especially when it is conducted in public.

Recently, two other approaches have been developed, with the potential to make a “culture of criticism” more useful and more socially acceptable. Registered Reports ( Chambers, 2019 ) is an approach that was devised to prevent publication bias, p -hacking, and HARKing. This format moves the peer review process to a point before data collection so that results cannot influence editorial decisions. An unexpected positive consequence is that peer review comes at a point when it can be acted upon to improve the experimental design. Where reviewers of Registered Reports ask “how could we disprove the hypothesis?” and “what other explanations should we consider?” this can generate more informative experiments.

A related idea is borrowed from business practices and is known as the “pre mortem” approach ( Klein, 2007 ). Project developers gather together and are asked to imagine that a proposed project has gone ahead and failed. They are then encouraged to write down reasons why this has happened, allowing people to voice misgivings that they may have been reluctant to state openly, so they can be addressed before the project has begun. It would be worth evaluating the effectiveness of pre-mortems for scientific projects. We could strengthen this approach by incorporating ideas from Bang and Frith (2017) , who noted that group decision-making is most likely to be effective when the group is diverse and people can express their views anonymously. With both Registered Reports and the study pre-mortem, reviewers can have a role as critical friends who can encourage researchers to identify ways to improve a project before it is conducted. This can be a more positive experience for the reviewer, who may otherwise have no option but to recommend rejection of a study with flawed methodology.

Counteracting cherry-picking of literature

Turning to cherry-picking of prior literature, the established solution is the systematic review, where clear criteria are laid out in advance so that a comprehensive search can be made of all relevant studies ( Siddaway, Wood, & Hedges, 2019 ). The systematic review is only as good as the data that go into it, however, and if a field suffers from substantial publication bias and/or p -hacking, then, rather than tackling error entrenchment, it may add to it. With the most scrupulous search strategy, relevant papers with null results can be missed because positive results are mentioned in titles and abstracts of papers, whereas null results are not ( Lazic, 2016 , p. 15). This can mean that, if a study is looking at many possible associations (e.g., with brain regions or with genes), studies that considered a specific association but failed to find support for it will be systematically disregarded. This may explain why it seems to take 30 or 40 years for some erroneous entrenched theories to be abandoned. The situation may improve with increasing availability of open data. Provided data are adequately documented and accessible, the problem of missing relevant studies may be reduced.

Ultimately, the problem of biased reviews may not be soluble just by changing people’s citation habits. Journal editors and reviewers could insist that abstracts follow a structured format and report all variables that were tested, not just those that gave significant results. A more radical approach by funders may be needed to disrupt this wasteful cycle. When a research team applies to test a new idea, they could first be required to (a) conduct a systematic review (unless one has been recently done) and (b) replicate the original findings on which the work is based: this is the opposite to what happens currently, where novelty and originality are major criteria for funding. In addition, it could be made mandatory for any newly funded research idea to be investigated by at least two independent laboratories and using at least two different approaches (triangulation). All these measures would drastically slow down science and may be unfeasible where research needs highly specialised equipment, facilities, or skills that are specific to one laboratory. Nevertheless, slower science may be preferable to the current system where there are so many examples of false leads being pursued for decades, with consequent waste of resources.

Reconciling storytelling with honesty

Perhaps the hardest problem is how to reconcile our need for narrative with a “warts and all” account of research. Consider this advice from Bem (2004) —which I suspect many journal editors would endorse:

Contrary to the conventional wisdom, science does not care how clever or clairvoyant you were at guessing your results ahead of time. Scientific integrity does not require you to lead your readers through all your wrongheaded hunches only to show—voila!—they were wrongheaded. A journal article should not be a personal history of your stillborn thoughts . . . Your overriding purpose is to tell the world what you have learned from your study. If your results suggest a compelling framework for their presentation, adopt it and make the most instructive findings your centerpiece . . . Think of your dataset as a jewel. Your task is to cut and polish it, to select the facets to highlight, and to craft the best setting for it.

As Kerr (1998) pointed out, HARKing gives a misleading impression of what was found, which can be particularly damaging for students, who on reading literature may form the impression that it is normal for scientists to have their predictions confirmed and think of themselves as incompetent when their own experiments do not work out that way. One of the goals of pre-registration is to ensure that researchers do not omit inconvenient facts when writing up a study—or if they do, at least make it possible to see that this has been done. In the field of clinical medicine, impressive progress has been made in methodology, with registration now a requirement for clinical trials ( International Committee of Medical Journal Editors, 2019 ). Yet, Goldacre et al. (2019) found that even when a trial was registered, it was common for researchers to change the primary outcome measure without explanation, and it has been similarly noted that pre-registrations in psychology are often too ambiguous to preclude p -hacking ( Veldkamp et al., 2018 ). Registered Reports ( Chambers, 2019 ) adopt stricter standards that should prevent HARKing, but the author may struggle to maintain a strong narrative because messy reality makes a less compelling story than a set of results subjected to Bem’s (2004) cutting and polishing process.

Rewarding credible research practices

A final set of recommendations has to do with changing the culture so that incentives are aligned with efforts to counteract unhelpful cognitive constraints, and researchers are rewarded for doing reproducible, replicable research, rather than for grant income or publications in high-impact journals ( Forstmeier, Wagenmakers, & Parker, 2016 ; Pulverer, 2015 ). There is already evidence that funders are concerned to address problems with credibility of biomedical research ( Academy of Medical Sciences, 2015 ), and rigour and reproducibility are increasingly mentioned in grant guidelines (e.g., https://grants.nih.gov/policy/reproducibility/index.htm ). One funder, Cancer Research UK, is innovating by incorporating Registered Reports in a two-stage funding model ( Munafò, 2017 ). We now need publishers and institutions to follow suit and ensure that researchers are not disadvantaged by adopting a self-critical mind-set and engaging in practices of open and reproducible science ( Poldrack, 2019 ).

Acknowledgments

My thanks to Kate Nation, Matt Jaquiery, Joe Chislett, Laura Fortunato, Uta Frith, Stefan Lewandowsky, and Karalyn Patterson for invaluable comments on an early draft of this manuscript.

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author is supported by a Principal Research Fellowship from the Wellcome Trust (programme grant no. 082498) and European Research Council advanced grant no. 694189.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1747021819886519-img2.jpg

  • Tools and Resources
  • Customer Services
  • Affective Science
  • Biological Foundations of Psychology
  • Clinical Psychology: Disorders and Therapies
  • Cognitive Psychology/Neuroscience
  • Developmental Psychology
  • Educational/School Psychology
  • Forensic Psychology
  • Health Psychology
  • History and Systems of Psychology
  • Individual Differences
  • Methods and Approaches in Psychology
  • Neuropsychology
  • Organizational and Institutional Psychology
  • Personality
  • Psychology and Other Disciplines
  • Social Psychology
  • Sports Psychology
  • Back to results
  • Share This Facebook LinkedIn Twitter

Article contents

Research methods in sport and exercise psychology.

  • Sicong Liu Sicong Liu Department of Psychiatry and Social Sciences, Duke University School of Medicine
  •  and  Gershon Tenenbaum Gershon Tenenbaum Professor of Sport and Exercise Psychology, Florida State University
  • https://doi.org/10.1093/acrefore/9780190236557.013.224
  • Published online: 20 November 2018

Research methods in sport and exercise psychology are embedded in the domain’s network of methodological assumptions, historical traditions, and research themes. Sport and exercise psychology is a unique domain that derives and integrates concepts and terminologies from both psychology and kinesiology domains. Thus, research methods used to study the main concerns and interests of sport and exercise psychology represent the domain’s intellectual properties.

The main methods used in the sport and exercise psychology domain are: (a) experimental, (b) psychometric, (c) multivariate correlational, (d) meta-analytic, (e) idiosyncratic, and (f) qualitative approach. Each of these research methods tends to fulfill a distinguishable research purpose in the domain and thus enables the generation of evidence that is not readily gleaned through other methods. Although the six research methods represent a sufficient diversity of available methods in sport and exercise psychology, they must be viewed as a starting point for researchers interested in the domain. Other research methods (e.g., case study, Bayesian inferences, and psychophysiological approach) exist and bear potential to advance the domain of sport and exercise psychology.

  • methodology
  • sport and exercise psychology
  • research theme

You do not currently have access to this article

Please login to access the full content.

Access to the full content requires a subscription

Printed from Oxford Research Encyclopedias, Psychology. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 02 July 2024

  • Cookie Policy
  • Privacy Policy
  • Legal Notice
  • Accessibility
  • [185.194.105.172]
  • 185.194.105.172

Character limit 500 /500

IMAGES

  1. Quiz & Worksheet

    what is experimental research in psychology quizlet

  2. Chapter 7: Control techniques in Experimental Research Diagram

    what is experimental research in psychology quizlet

  3. Experimental Psychology additional Flashcards

    what is experimental research in psychology quizlet

  4. Explain the Difference Between Descriptive and Experimental Research

    what is experimental research in psychology quizlet

  5. Psychology: Experimental Methods & Ethical Considerations Flashcards

    what is experimental research in psychology quizlet

  6. 6) Experimental Research Flashcards

    what is experimental research in psychology quizlet

VIDEO

  1. What is experimental research

  2. #psychology#urdu#aptetdsc #Jeanpiaget#cognitive theory quiz

  3. Difference Between Experimental & Non-Experimental Research in Hindi

  4. Work of experimental psychologist

  5. What is experimental research design? (4 of 11)

  6. 1. Experimental Psychology and the Scientific Method

COMMENTS

  1. Experimental Psychology

    A descriptive analysis; the systematic investigation into and study of materials and sources in order to establish facts and reach new conclusions. What is an experiment? Manipulation of variables; a scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact.

  2. Psychology Chapter 1: Experimental Research Flashcards

    Terms in this set (6) Experimental Research. Establishes cause and effect. Independent. The variable that is manipulated in the experiment. Dependent. The variable thay is measured in the experiment. Placebo effect. Participants of an experiment think they feel differently because of fhe expectations they have about the outcome of the experiment.

  3. Experimental Research Flashcards

    experimental study. the researcher manipulates at least one independent variable, controls other relevant variables, and observes the effect on one or more dependent variables. independent variable. aka experimental variable, cause, or treatment; is that process or activity believed to make a difference in performance. dependent variable.

  4. Experimental Method In Psychology

    1. Lab Experiment. A laboratory experiment in psychology is a research method in which the experimenter manipulates one or more independent variables and measures the effects on the dependent variable under controlled conditions. A laboratory experiment is conducted under highly controlled conditions (not necessarily a laboratory) where ...

  5. How the Experimental Method Works in Psychology

    The experimental method involves manipulating one variable to determine if this causes changes in another variable. This method relies on controlled research methods and random assignment of study subjects to test a hypothesis. For example, researchers may want to learn how different visual patterns may impact our perception.

  6. Chapter 6: Experimental Research

    Chapter 6: Experimental Research. 6.1 Experiment Basics. 6.2 Experimental Design. 6.3 Conducting Experiments. Previous: 5.3 Practical Strategies for Psychological Measurement.

  7. 3.2 Psychologists Use Descriptive, Correlational, and Experimental

    The goal of experimental research design is to provide more definitive conclusions about the causal relationships among the variables in the research hypothesis than is available from correlational designs. In an experimental research design, the variables of interest are called the independent variable (or variables) and the dependent variable.

  8. Experimental Design: Types, Examples & Methods

    Three types of experimental designs are commonly used: 1. Independent Measures. Independent measures design, also known as between-groups, is an experimental design where different participants are used in each condition of the independent variable. This means that each condition of the experiment includes a different group of participants.

  9. Chapter 6: Experimental Research

    Chapter 6: Experimental Research. In the late 1960s social psychologists John Darley and Bibb Latané proposed a counterintuitive hypothesis. The more witnesses there are to an accident or a crime, the less likely any of them is to help the victim (Darley & Latané, 1968)[1]. They also suggested the theory that this phenomenon occurs because ...

  10. 6.1 Experiment Basics

    Experiments have two fundamental features. The first is that the researchers manipulate, or systematically vary, the level of the independent variable. The different levels of the independent variable are called conditions. For example, in Darley and Latané's experiment, the independent variable was the number of witnesses that participants ...

  11. Experimental Psychology

    Applied research- is research that seeks to answer a question in the real world and to solve a problem. Basic research -Research that is done specifically to add to our general understanding of psychology, like distinguishing the components of extraversion or predicting the time it takes a person to determine whether an object is a face or another object Sherif's Work- subjects were asked how ...

  12. Experimental psychology

    Experimental psychology refers to work done by those who apply experimental methods to psychological study and the underlying processes. Experimental psychologists employ human participants and animal subjects to study a great many topics, including (among others) sensation, perception, memory, cognition, learning, motivation, emotion; developmental processes, social psychology, and the neural ...

  13. Overview of the Types of Research in Psychology

    Psychology research can usually be classified as one of three major types. 1. Causal or Experimental Research. When most people think of scientific experimentation, research on cause and effect is most often brought to mind. Experiments on causal relationships investigate the effect of one or more variables on one or more outcome variables.

  14. Experiment Basics

    Experiments have two fundamental features. The first is that the researchers manipulate, or systematically vary, the level of the independent variable. The different levels of the independent variable are called conditions. For example, in Darley and Latané's experiment, the independent variable was the number of witnesses that participants ...

  15. What is Experimental Psychology

    Research is the focus of experimental psychology. Using scientific methods to collect data and perform research, experimental psychology focuses on certain questions, and, one study at a time, reveals information that contributes to larger findings or a conclusion. Due to the breadth and depth of certain areas of study, researchers can spend ...

  16. 6.2 Experimental Design

    Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too. In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition ...

  17. 4.3: Experimental Research

    In an experiment, researchers manipulate, or cause changes, in the independent variable and observe or mea- sure any impact of those changes in the dependent variable. The independent variable is the one under the experimenter's control, or the variable that is intentionally altered between groups. In the case of Dunn's experiment, the ...

  18. PSYCHOLOGY

    PSYCHOLOGY - EXPERIMENTAL RESEARCH. Define the three variables. Click the card to flip 👆. - Independent variable = the variable that is being manipulated or changed to assess its effect on something. - Dependent variable = the variable that shows the effect of the IV, what is measured. - Extraneous variables = external variables that could ...

  19. Chapter 5: Experimental Research

    The research that Darley and Latané conducted was a particular kind of study called an experiment. Experiments are used to determine not only whether there is a meaningful relationship between two variables but also whether the relationship is a causal one that is supported by statistical analysis. For this reason, experiments are one of the ...

  20. 2.1 Why Is Research Important?

    Mary Whiton Calkins (1863-1930) was a preeminent first-generation American psychologist who opposed the behaviorist movement, conducted significant research into memory, and established one of the earliest experimental psychology labs in the United States (Mary Whiton Calkins, n.d.).

  21. What is research? Flashcards

    Explain the difference between descriptive and experimental research. Descriptive research is a type of research that simply describes behavior or data through observation research methods whereas experimental research attempt to seek and explain cause and effect relationships by manipulating data. Describe what it means when psychologists ...

  22. The psychology of experimental psychologists: Overcoming cognitive

    Introduction. The past decade has been a bruising one for experimental psychology. The publication of a paper by Simmons, Nelson, and Simonsohn (2011) entitled "False-positive psychology" drew attention to problems with the way in which research was often conducted in our field, which meant that many results could not be trusted. Simmons et al. focused on "undisclosed flexibility in data ...

  23. Psy 311 Introduction to Experimental Psychology

    Psy 311 Introduction to Experimental Psychology. Share. The scientific approach can be broken down into a series of 5 steps or activities what are they? Click the card to flip 👆. -identify the problem. -designing the experiment. -conducting the experiment. -testing the hypotheses. -writing the research report.

  24. Research Methods in Sport and Exercise Psychology

    The main methods used in the sport and exercise psychology domain are: (a) experimental, (b) psychometric, (c) multivariate correlational, (d) meta-analytic, (e) idiosyncratic, and (f) qualitative approach. Each of these research methods tends to fulfill a distinguishable research purpose in the domain and thus enables the generation of ...