Experimenter Bias (Definition + Examples)

practical psychology logo

In the early 1900s, a German high school teacher named Wilhelm Von Osten thought that the intelligence of animals was underrated. He decided to teach his horse, Hans, some basic arithmetics to prove his point. Clever Hans, as the horse came to be known, was learning quickly. Soon he could add, subtract, multiply, and divide and would give correct answers by tapping his hoof. It took scientists over a year to prove that the horse wasn’t doing the calculations himself. It turned out that Clever Hans was picking up subtle cues from his owner’s facial expressions and gestures.

Influencing the outcome of an experiment in this way is called "experimenter bias" or "observer-expectancy bias."

What is Experimenter Bias?

Experimenter bias occurs when a researcher either intentionally or unintentionally affects data, participants, or results in an experiment. 

The phenomenon is also known as observer bias, information bias, research bias, expectancy bias, experimenter effect, observer-expectancy effect, experimenter-expectancy effect, and observer effect. 

experimenter bias

One of the leading causes of experimenter bias is the human inability to remain completely objective. Biases like confirmation bias and  hindsight bias  affect our judgment every day! In the case of the experimenter bias, people conducting research may lean toward their original expectations about a hypothesis without the experimenter being aware of making an error or treating participants differently. These expectations can influence how studies are structured, conducted, and interpreted. They may negatively affect the results, making them flawed or irrelevant. In a way, this is often a more specific case of confirmation bias .

Rosenthal and Fode Experiment

One of the best-known examples of experimenter bias is the experiment conducted by psychologists Robert Rosenthal and Kermit Fode in 1963. 

Rosenthal and Kermit asked two groups of psychology students to assess the ability of rats to navigate a maze. While one group was told their rats were “bright”, the other was convinced they were assigned “dull” rats. The rats were randomly chosen, and no significant difference existed between them. 

Interestingly, the students who were told their rats were maze-bright reported faster running times than those who did not expect their rodents to perform well. In other words, the students’ expectations directly influenced the obtained results. 

Rosenthal and Fode’s experiment shows how the outcomes of a study can be modified as a consequence of the interaction between the experimenter and the subject. 

However, experimenter-subject interaction is not the only source of experimenter bias. (It's not the only time bias may appear as one observes another person's actions. We are influenced by the  actor-observer bias  daily, whether or not we work in a psychology lab!)

Types of Experimenter Bias

Experimenter bias can occur in all study phases, from the initial background research and survey design to data analysis and the final presentation of results. 

Design bias

design bias

Design bias is one of the most frequent types of experimenter biases. It happens when researchers establish a particular hypothesis and shape their entire methodology to confirm it. Rosenthal showed that 70% of experimenter biases influence outcomes in favor of the researcher‘s hypothesis.

Example of Experimenter Bias (Design Bias)

An experimenter believes separating men and women for long periods eventually makes them restless and hostile. It's a silly hypothesis, but it could be "proven" through design bias. Let's say a psychologist sets this idea as their hypothesis. They measure participants' stress levels before the experiment begins. During the experiment, the participants are separated by gender and isolated from the world. Their diets are off. Routines are shifted. Participants don't have access to their friends or family. Surely, they are going to get restless. The psychologist could argue that these results prove his point. But does it?

Not all examples of design bias are this extreme, but it shows how it can influence outcomes.

Sampling bias

sampling bias

Sampling or selection bias refers to choosing participants so that certain demographics are underrepresented or overrepresented in a study. Studies affected by the sampling bias are not based on a fully representative group.

The omission bias occurs when participants of certain ethnic or age groups are omitted from the sample. In the inclusive bias, on the contrary, samples are selected for convenience, such as all participants fitting a narrow demographic range. 

Example of Experimenter Bias (Sampling Bias)

Philip Zimbardo created the Stanford Prison Experiment to answer the question, "What happens when you put good people in an evil place?" The experiment is now one of the most infamous experiments in social psychology. But there is (at least) one problem with Zimbardo's attempt to answer such a vague question. He does not put all types of "good people" in an evil place. All the participants in the Stanford Prison Experiment were young men. Can 24 young men of the same age and background reflect the mindsets of all "good people?" Not really.

Procedural bias

procedural bias

Procedural bias arises when how the experimenter carries out a study affects the results. If participants are given only a short time to answer questions, their responses will be rushed and not correctly show their opinions or knowledge.

Example of Experimenter Bias (Procedural Bias)

Once again, the Stanford Prison Experiment offers a good example of experimenter bias. This example is merely an accusation. Years after the experiment made headlines, Zimbardo was accused of "coaching" the guards. The coaching allegedly encouraged the guards to act aggressively toward the prisoners. If this is true, then the findings regarding the guards' aggression should not reflect the premise of the experiment but the procedure. What happens when you put good people in an evil place and coach them to be evil?

Measurement bias

measurement bias

Measurement bias is a systematic error during the data collection phase of research. It can take place when the equipment used is faulty or when it is not being used correctly. 

Example of Experimenter Bias (Measurement Bias)

Failing to calibrate scales can drastically change the results of a study! Another example of this is rounding up or down. If an experimenter is not exact with their measurements, they could skew the results. Bias does not have to be nefarious, it can just be neglectful.

Interviewer bias

interviewer bias

Interviewers can consciously or subconsciously influence responses by providing additional information and subtle clues. As we have seen in the rat-maze experiment, the subject's response will inevitably lean towards the interviewer’s opinions. 

Example of Experimenter Bias (Interview Bias)

Think about the difference between the following sets of questions:

  • "How often do you bathe?" vs. "I'm sure you're very hygienic, right?"
  • "On a scale from 1-10, how much pain did you experience?" vs. "Was the pain mild, moderate, or excruciating?"
  • "Who influenced you to become kind?" vs. "Did your mother teach you to use manners?"

The differences between these questions are subtle. In some contexts, researchers may not consider them to be biased! If you are creating questions for an interview, be sure to consult a diverse group of researchers. Interview bias can come from our upbringing, media consumption, and other factors we cannot control!

Response bias

response bias

Response bias is a tendency to answer questions inaccurately. Participants may want to provide the answers they think are correct, for instance, or those more socially acceptable than they truly believe. Responders are often subject to  the Hawthorne effect , a phenomenon where people make more efforts and perform better in a study because they know they are being observed. 

Example of Experimenter Bias (Response Bias)

The Asch Line Study is a great example of this bias. Of course, researchers created this study to show the impact of response bias. In the study, participants sat among several "actors." The researcher asked the room to identify a certain line. Every actor in the room answered incorrectly. To confirm, many participants went along with the wrong answer. This is response bias, and it happens more often than you think.

Reporting bias

reporting bias

Reporting bias, also called selective reporting, arises when the nature of the results influences the dissemination of research findings. This type of bias is usually out of the researcher’s control. Even though studies with negative results can be just as significant as positive ones, the latter are much more likely to be reported, published, and cited by others. 

Example of Experimenter Bias (Reporting Bias)

Why do we hear about the Stanford Prison Experiment more than other experiments? Reporting bias! The Stanford Prison Experiment is fascinating. The drama surrounding the results makes great headlines. Stanford is a prestigious school. There is even a movie about it! Yes, some biases went into the study. However, psychologists and content creators will continue discussing this experiment for many years.

How Can You Remove Experimenter Bias From Research?

Unfortunately, experimenter bias cannot be wholly stamped out as long as humans are involved in the experiment process. Our upbringing, education, and experience may always color how we gather and analyze data. However, experimenter bias can be controlled by sharing this phenomenon with people involved in conducting experiments first! 

How Can Experimenter Bias Be Controlled? 

One way to control experimenter bias is to intentionally put together a diverse team and encourage open communication about how to conduct experiments. The larger the group, the more perspectives will be shared, and biases will be revealed. Biases should be considered at every step of the process. 

Strategies to Avoid Experimenter Bias

Most modern experiments are designed to reduce the possibility of bias-distorted results. In general, biases can be kept to a minimum if experimenters are properly trained and clear rules and procedures are implemented. 

There are several concrete ways in which researchers can avoid experimenter bias.

Blind analysis

A blind analysis is an optimal way of reducing experimenter bias in many research fields. All the information which may influence the outcome of the experiment is withheld. Researchers are sometimes not informed about the true results until they have completed the analysis. Similarly, when participants are unaware of the hypothesis, they cannot influence the experiment's outcome. 

Double-blind study

double blind experiments

Double-blind techniques are commonly used in clinical research. In contrast to an open trial, a double-blind study is done so that neither the clinician nor the patients know the nature of the treatment. They don’t know who is receiving an actual treatment and who is given a placebo, thus eliminating any design or interview biases from the experiment.

Minimizing exposure 

The less exposure respondents have to experimenters, the less likely they will pick up any cues that would impact their answers. One of the common ways to minimize the interaction between participants and experimenters is to pre-record the instructions.

Peer review

Peer review involves assessing work by individuals possessing comparable expertise to the researcher. Their role is to identify potential biases and thus make sure that the study is reliable and worthy of publication.

Understanding and addressing experimenter bias is crucial in psychological research and beyond. It reminds us that human perception and interpretation can significantly shape outcomes, whether it's Clever Hans responding to his owner's cues or students' expectations influencing their rats' performances.

Researchers can strive for more accurate, reliable, and meaningful results by acknowledging and actively working to minimize these biases. This awareness enhances the integrity of scientific research. It deepens our understanding of the complex interplay between observer and subject, ultimately leading to more profound insights into the human mind and behavior.

Related posts:

  • 19+ Experimental Design Examples (Methods + Types)
  • Backward Design (Lesson Planning + Examples)
  • Actor Observer Bias (Definition + Examples)
  • Philip Zimbardo (Biography + Experiments)
  • Confirmation Bias (Examples + Definition)

Reference this article:

About The Author

Photo of author

Free Personality Test

Free Personality Quiz

Free Memory Test

Free Memory Test

Free IQ Test

Free IQ Test

PracticalPie.com is a participant in the Amazon Associates Program. As an Amazon Associate we earn from qualifying purchases.

Follow Us On:

Youtube Facebook Instagram X/Twitter

Psychology Resources

Developmental

Personality

Relationships

Psychologists

Serial Killers

Psychology Tests

Personality Quiz

Memory Test

Depression test

Type A/B Personality Test

© PracticalPsychology. All rights reserved

Privacy Policy | Terms of Use

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Research bias

Types of Bias in Research | Definition & Examples

Research bias results from any deviation from the truth, causing distorted results and wrong conclusions. Bias can occur at any phase of your research, including during data collection , data analysis , interpretation, or publication. Research bias can occur in both qualitative and quantitative research .

Understanding research bias is important for several reasons.

  • Bias exists in all research, across research designs , and is difficult to eliminate.
  • Bias can occur at any stage of the research process .
  • Bias impacts the validity and reliability of your findings, leading to misinterpretation of data.

It is almost impossible to conduct a study without some degree of research bias. It’s crucial for you to be aware of the potential types of bias, so you can minimize them.

For example, the success rate of the program will likely be affected if participants start to drop out ( attrition ). Participants who become disillusioned due to not losing weight may drop out, while those who succeed in losing weight are more likely to continue. This in turn may bias the findings towards more favorable results.  

Table of contents

Information bias, interviewer bias.

  • Publication bias

Researcher bias

Response bias.

Selection bias

Cognitive bias

How to avoid bias in research

Other types of research bias, frequently asked questions about research bias.

Information bias , also called measurement bias, arises when key study variables are inaccurately measured or classified. Information bias occurs during the data collection step and is common in research studies that involve self-reporting and retrospective data collection. It can also result from poor interviewing techniques or differing levels of recall from participants.

The main types of information bias are:

  • Recall bias
  • Observer bias

Performance bias

Regression to the mean (rtm).

Over a period of four weeks, you ask students to keep a journal, noting how much time they spent on their smartphones along with any symptoms like muscle twitches, aches, or fatigue.

Recall bias is a type of information bias. It occurs when respondents are asked to recall events in the past and is common in studies that involve self-reporting.

As a rule of thumb, infrequent events (e.g., buying a house or a car) will be memorable for longer periods of time than routine events (e.g., daily use of public transportation). You can reduce recall bias by running a pilot survey and carefully testing recall periods. If possible, test both shorter and longer periods, checking for differences in recall.

  • A group of children who have been diagnosed, called the case group
  • A group of children who have not been diagnosed, called the control group

Since the parents are being asked to recall what their children generally ate over a period of several years, there is high potential for recall bias in the case group.

The best way to reduce recall bias is by ensuring your control group will have similar levels of recall bias to your case group. Parents of children who have childhood cancer, which is a serious health problem, are likely to be quite concerned about what may have contributed to the cancer.

Thus, if asked by researchers, these parents are likely to think very hard about what their child ate or did not eat in their first years of life. Parents of children with other serious health problems (aside from cancer) are also likely to be quite concerned about any diet-related question that researchers ask about.

Observer bias is the tendency of research participants to see what they expect or want to see, rather than what is actually occurring. Observer bias can affect the results in observationa l and experimental studies, where subjective judgment (such as assessing a medical image) or measurement (such as rounding blood pressure readings up or down) is part of the d ata collection process.

Observer bias leads to over- or underestimation of true values, which in turn compromise the validity of your findings. You can reduce observer bias by using double-blinded  and single-blinded research methods.

Based on discussions you had with other researchers before starting your observations , you are inclined to think that medical staff tend to simply call each other when they need specific patient details or have questions about treatments.

At the end of the observation period, you compare notes with your colleague. Your conclusion was that medical staff tend to favor phone calls when seeking information, while your colleague noted down that medical staff mostly rely on face-to-face discussions. Seeing that your expectations may have influenced your observations, you and your colleague decide to conduct semi-structured interviews with medical staff to clarify the observed events. Note: Observer bias and actor–observer bias are not the same thing.

Performance bias is unequal care between study groups. Performance bias occurs mainly in medical research experiments, if participants have knowledge of the planned intervention, therapy, or drug trial before it begins.

Studies about nutrition, exercise outcomes, or surgical interventions are very susceptible to this type of bias. It can be minimized by using blinding , which prevents participants and/or researchers from knowing who is in the control or treatment groups. If blinding is not possible, then using objective outcomes (such as hospital admission data) is the best approach.

When the subjects of an experimental study change or improve their behavior because they are aware they are being studied, this is called the Hawthorne effect (or observer effect). Similarly, the John Henry effect occurs when members of a control group are aware they are being compared to the experimental group. This causes them to alter their behavior in an effort to compensate for their perceived disadvantage.

Regression to the mean (RTM) is a statistical phenomenon that refers to the fact that a variable that shows an extreme value on its first measurement will tend to be closer to the center of its distribution on a second measurement.

Medical research is particularly sensitive to RTM. Here, interventions aimed at a group or a characteristic that is very different from the average (e.g., people with high blood pressure) will appear to be successful because of the regression to the mean. This can lead researchers to misinterpret results, describing a specific intervention as causal when the change in the extreme groups would have happened anyway.

In general, among people with depression, certain physical and mental characteristics have been observed to deviate from the population mean .

This could lead you to think that the intervention was effective when those treated showed improvement on measured post-treatment indicators, such as reduced severity of depressive episodes.

However, given that such characteristics deviate more from the population mean in people with depression than in people without depression, this improvement could be attributed to RTM.

Interviewer bias stems from the person conducting the research study. It can result from the way they ask questions or react to responses, but also from any aspect of their identity, such as their sex, ethnicity, social class, or perceived attractiveness.

Interviewer bias distorts responses, especially when the characteristics relate in some way to the research topic. Interviewer bias can also affect the interviewer’s ability to establish rapport with the interviewees, causing them to feel less comfortable giving their honest opinions about sensitive or personal topics.

Participant: “I like to solve puzzles, or sometimes do some gardening.”

You: “I love gardening, too!”

In this case, seeing your enthusiastic reaction could lead the participant to talk more about gardening.

Establishing trust between you and your interviewees is crucial in order to ensure that they feel comfortable opening up and revealing their true thoughts and feelings. At the same time, being overly empathetic can influence the responses of your interviewees, as seen above.

Publication bias occurs when the decision to publish research findings is based on their nature or the direction of their results. Studies reporting results that are perceived as positive, statistically significant , or favoring the study hypotheses are more likely to be published due to publication bias.

Publication bias is related to data dredging (also called p -hacking ), where statistical tests on a set of data are run until something statistically significant happens. As academic journals tend to prefer publishing statistically significant results, this can pressure researchers to only submit statistically significant results. P -hacking can also involve excluding participants or stopping data collection once a p value of 0.05 is reached. However, this leads to false positive results and an overrepresentation of positive results in published academic literature.

Researcher bias occurs when the researcher’s beliefs or expectations influence the research design or data collection process. Researcher bias can be deliberate (such as claiming that an intervention worked even if it didn’t) or unconscious (such as letting personal feelings, stereotypes, or assumptions influence research questions ).

The unconscious form of researcher bias is associated with the Pygmalion effect (or Rosenthal effect ), where the researcher’s high expectations (e.g., that patients assigned to a treatment group will succeed) lead to better performance and better outcomes.

Researcher bias is also sometimes called experimenter bias, but it applies to all types of investigative projects, rather than only to experimental designs .

  • Good question: What are your views on alcohol consumption among your peers?
  • Bad question: Do you think it’s okay for young people to drink so much?

Response bias is a general term used to describe a number of different situations where respondents tend to provide inaccurate or false answers to self-report questions, such as those asked on surveys or in structured interviews .

This happens because when people are asked a question (e.g., during an interview ), they integrate multiple sources of information to generate their responses. Because of that, any aspect of a research study may potentially bias a respondent. Examples include the phrasing of questions in surveys, how participants perceive the researcher, or the desire of the participant to please the researcher and to provide socially desirable responses.

Response bias also occurs in experimental medical research. When outcomes are based on patients’ reports, a placebo effect can occur. Here, patients report an improvement despite having received a placebo, not an active medical treatment.

While interviewing a student, you ask them:

“Do you think it’s okay to cheat on an exam?”

Common types of response bias are:

Acquiescence bias

Demand characteristics.

  • Social desirability bias

Courtesy bias

  • Question-order bias

Extreme responding

Acquiescence bias is the tendency of respondents to agree with a statement when faced with binary response options like “agree/disagree,” “yes/no,” or “true/false.” Acquiescence is sometimes referred to as “yea-saying.”

This type of bias occurs either due to the participant’s personality (i.e., some people are more likely to agree with statements than disagree, regardless of their content) or because participants perceive the researcher as an expert and are more inclined to agree with the statements presented to them.

Q: Are you a social person?

People who are inclined to agree with statements presented to them are at risk of selecting the first option, even if it isn’t fully supported by their lived experiences.

In order to control for acquiescence, consider tweaking your phrasing to encourage respondents to make a choice truly based on their preferences. Here’s an example:

Q: What would you prefer?

  • A quiet night in
  • A night out with friends

Demand characteristics are cues that could reveal the research agenda to participants, risking a change in their behaviors or views. Ensuring that participants are not aware of the research objectives is the best way to avoid this type of bias.

On each occasion, patients reported their pain as being less than prior to the operation. While at face value this seems to suggest that the operation does indeed lead to less pain, there is a demand characteristic at play. During the interviews, the researcher would unconsciously frown whenever patients reported more post-op pain. This increased the risk of patients figuring out that the researcher was hoping that the operation would have an advantageous effect.

Social desirability bias is the tendency of participants to give responses that they believe will be viewed favorably by the researcher or other participants. It often affects studies that focus on sensitive topics, such as alcohol consumption or sexual behavior.

You are conducting face-to-face semi-structured interviews with a number of employees from different departments. When asked whether they would be interested in a smoking cessation program, there was widespread enthusiasm for the idea.

Note that while social desirability and demand characteristics may sound similar, there is a key difference between them. Social desirability is about conforming to social norms, while demand characteristics revolve around the purpose of the research.

Courtesy bias stems from a reluctance to give negative feedback, so as to be polite to the person asking the question. Small-group interviewing where participants relate in some way to each other (e.g., a student, a teacher, and a dean) is especially prone to this type of bias.

Question order bias

Question order bias occurs when the order in which interview questions are asked influences the way the respondent interprets and evaluates them. This occurs especially when previous questions provide context for subsequent questions.

When answering subsequent questions, respondents may orient their answers to previous questions (called a halo effect ), which can lead to systematic distortion of the responses.

Extreme responding is the tendency of a respondent to answer in the extreme, choosing the lowest or highest response available, even if that is not their true opinion. Extreme responding is common in surveys using Likert scales , and it distorts people’s true attitudes and opinions.

Disposition towards the survey can be a source of extreme responding, as well as cultural components. For example, people coming from collectivist cultures tend to exhibit extreme responses in terms of agreement, while respondents indifferent to the questions asked may exhibit extreme responses in terms of disagreement.

Selection bias is a general term describing situations where bias is introduced into the research from factors affecting the study population.

Common types of selection bias are:

Sampling or ascertainment bias

  • Attrition bias
  • Self-selection (or volunteer) bias
  • Survivorship bias
  • Nonresponse bias
  • Undercoverage bias

Sampling bias occurs when your sample (the individuals, groups, or data you obtain for your research) is selected in a way that is not representative of the population you are analyzing. Sampling bias threatens the external validity of your findings and influences the generalizability of your results.

The easiest way to prevent sampling bias is to use a probability sampling method . This way, each member of the population you are studying has an equal chance of being included in your sample.

Sampling bias is often referred to as ascertainment bias in the medical field.

Attrition bias occurs when participants who drop out of a study systematically differ from those who remain in the study. Attrition bias is especially problematic in randomized controlled trials for medical research because participants who do not like the experience or have unwanted side effects can drop out and affect your results.

You can minimize attrition bias by offering incentives for participants to complete the study (e.g., a gift card if they successfully attend every session). It’s also a good practice to recruit more participants than you need, or minimize the number of follow-up sessions or questions.

You provide a treatment group with weekly one-hour sessions over a two-month period, while a control group attends sessions on an unrelated topic. You complete five waves of data collection to compare outcomes: a pretest survey, three surveys during the program, and a posttest survey.

Self-selection or volunteer bias

Self-selection bias (also called volunteer bias ) occurs when individuals who volunteer for a study have particular characteristics that matter for the purposes of the study.

Volunteer bias leads to biased data, as the respondents who choose to participate will not represent your entire target population. You can avoid this type of bias by using random assignment —i.e., placing participants in a control group or a treatment group after they have volunteered to participate in the study.

Closely related to volunteer bias is nonresponse bias , which occurs when a research subject declines to participate in a particular study or drops out before the study’s completion.

Considering that the hospital is located in an affluent part of the city, volunteers are more likely to have a higher socioeconomic standing, higher education, and better nutrition than the general population.

Survivorship bias occurs when you do not evaluate your data set in its entirety: for example, by only analyzing the patients who survived a clinical trial.

This strongly increases the likelihood that you draw (incorrect) conclusions based upon those who have passed some sort of selection process—focusing on “survivors” and forgetting those who went through a similar process and did not survive.

Note that “survival” does not always mean that participants died! Rather, it signifies that participants did not successfully complete the intervention.

However, most college dropouts do not become billionaires. In fact, there are many more aspiring entrepreneurs who dropped out of college to start companies and failed than succeeded.

Nonresponse bias occurs when those who do not respond to a survey or research project are different from those who do in ways that are critical to the goals of the research. This is very common in survey research, when participants are unable or unwilling to participate due to factors like lack of the necessary skills, lack of time, or guilt or shame related to the topic.

You can mitigate nonresponse bias by offering the survey in different formats (e.g., an online survey, but also a paper version sent via post), ensuring confidentiality , and sending them reminders to complete the survey.

You notice that your surveys were conducted during business hours, when the working-age residents were less likely to be home.

Undercoverage bias occurs when you only sample from a subset of the population you are interested in. Online surveys can be particularly susceptible to undercoverage bias. Despite being more cost-effective than other methods, they can introduce undercoverage bias as a result of excluding people who do not use the internet.

Cognitive bias refers to a set of predictable (i.e., nonrandom) errors in thinking that arise from our limited ability to process information objectively. Rather, our judgment is influenced by our values, memories, and other personal traits. These create “ mental shortcuts” that help us process information intuitively and decide faster. However, cognitive bias can also cause us to misunderstand or misinterpret situations, information, or other people.

Because of cognitive bias, people often perceive events to be more predictable after they happen.

Although there is no general agreement on how many types of cognitive bias exist, some common types are:

  • Anchoring bias  
  • Framing effect  
  • Actor-observer bias
  • Availability heuristic (or availability bias)
  • Confirmation bias  
  • Halo effect
  • The Baader-Meinhof phenomenon  

Anchoring bias

Anchoring bias is people’s tendency to fixate on the first piece of information they receive, especially when it concerns numbers. This piece of information becomes a reference point or anchor. Because of that, people base all subsequent decisions on this anchor. For example, initial offers have a stronger influence on the outcome of negotiations than subsequent ones.

  • Framing effect

Framing effect refers to our tendency to decide based on how the information about the decision is presented to us. In other words, our response depends on whether the option is presented in a negative or positive light, e.g., gain or loss, reward or punishment, etc. This means that the same information can be more or less attractive depending on the wording or what features are highlighted.

Actor–observer bias

Actor–observer bias occurs when you attribute the behavior of others to internal factors, like skill or personality, but attribute your own behavior to external or situational factors.

In other words, when you are the actor in a situation, you are more likely to link events to external factors, such as your surroundings or environment. However, when you are observing the behavior of others, you are more likely to associate behavior with their personality, nature, or temperament.

One interviewee recalls a morning when it was raining heavily. They were rushing to drop off their kids at school in order to get to work on time. As they were driving down the highway, another car cut them off as they were trying to merge. They tell you how frustrated they felt and exclaim that the other driver must have been a very rude person.

At another point, the same interviewee recalls that they did something similar: accidentally cutting off another driver while trying to take the correct exit. However, this time, the interviewee claimed that they always drive very carefully, blaming their mistake on poor visibility due to the rain.

  • Availability heuristic

Availability heuristic (or availability bias) describes the tendency to evaluate a topic using the information we can quickly recall to our mind, i.e., that is available to us. However, this is not necessarily the best information, rather it’s the most vivid or recent. Even so, due to this mental shortcut, we tend to think that what we can recall must be right and ignore any other information.

  • Confirmation bias

Confirmation bias is the tendency to seek out information in a way that supports our existing beliefs while also rejecting any information that contradicts those beliefs. Confirmation bias is often unintentional but still results in skewed results and poor decision-making.

Let’s say you grew up with a parent in the military. Chances are that you have a lot of complex emotions around overseas deployments. This can lead you to over-emphasize findings that “prove” that your lived experience is the case for most families, neglecting other explanations and experiences.

The halo effect refers to situations whereby our general impression about a person, a brand, or a product is shaped by a single trait. It happens, for instance, when we automatically make positive assumptions about people based on something positive we notice, while in reality, we know little about them.

The Baader-Meinhof phenomenon

The Baader-Meinhof phenomenon (or frequency illusion) occurs when something that you recently learned seems to appear “everywhere” soon after it was first brought to your attention. However, this is not the case. What has increased is your awareness of something, such as a new word or an old song you never knew existed, not their frequency.

While very difficult to eliminate entirely, research bias can be mitigated through proper study design and implementation. Here are some tips to keep in mind as you get started.

  • Clearly explain in your methodology section how your research design will help you meet the research objectives and why this is the most appropriate research design.
  • In quantitative studies , make sure that you use probability sampling to select the participants. If you’re running an experiment, make sure you use random assignment to assign your control and treatment groups.
  • Account for participants who withdraw or are lost to follow-up during the study. If they are withdrawing for a particular reason, it could bias your results. This applies especially to longer-term or longitudinal studies .
  • Use triangulation to enhance the validity and credibility of your findings.
  • Phrase your survey or interview questions in a neutral, non-judgmental tone. Be very careful that your questions do not steer your participants in any particular direction.
  • Consider using a reflexive journal. Here, you can log the details of each interview , paying special attention to any influence you may have had on participants. You can include these in your final analysis.
  • Baader–Meinhof phenomenon
  • Sampling bias
  • Ascertainment bias
  • Self-selection bias
  • Hawthorne effect
  • Omitted variable bias
  • Pygmalion effect
  • Placebo effect

Research bias affects the validity and reliability of your research findings , leading to false conclusions and a misinterpretation of the truth. This can have serious implications in areas like medical research where, for example, a new form of treatment may be evaluated.

Observer bias occurs when the researcher’s assumptions, views, or preconceptions influence what they see and record in a study, while actor–observer bias refers to situations where respondents attribute internal factors (e.g., bad character) to justify other’s behavior and external factors (difficult circumstances) to justify the same behavior in themselves.

Response bias is a general term used to describe a number of different conditions or factors that cue respondents to provide inaccurate or false answers during surveys or interviews. These factors range from the interviewer’s perceived social position or appearance to the the phrasing of questions in surveys.

Nonresponse bias occurs when the people who complete a survey are different from those who did not, in ways that are relevant to the research topic. Nonresponse can happen because people are either not willing or not able to participate.

Is this article helpful?

Other students also liked.

  • Attrition Bias | Examples, Explanation, Prevention
  • Observer Bias | Definition, Examples, Prevention
  • What Is Social Desirability Bias? | Definition & Examples

More interesting articles

  • Demand Characteristics | Definition, Examples & Control
  • Hostile Attribution Bias | Definition & Examples
  • Regression to the Mean | Definition & Examples
  • Representativeness Heuristic | Example & Definition
  • Sampling Bias and How to Avoid It | Types & Examples
  • Self-Fulfilling Prophecy | Definition & Examples
  • The Availability Heuristic | Example & Definition
  • The Baader–Meinhof Phenomenon Explained
  • What Is a Ceiling Effect? | Definition & Examples
  • What Is Actor-Observer Bias? | Definition & Examples
  • What Is Affinity Bias? | Definition & Examples
  • What Is Anchoring Bias? | Definition & Examples
  • What Is Ascertainment Bias? | Definition & Examples
  • What Is Belief Bias? | Definition & Examples
  • What Is Bias for Action? | Definition & Examples
  • What Is Cognitive Bias? | Definition, Types, & Examples
  • What Is Confirmation Bias? | Definition & Examples
  • What Is Conformity Bias? | Definition & Examples
  • What Is Correspondence Bias? | Definition & Example
  • What Is Explicit Bias? | Definition & Examples
  • What Is Generalizability? | Definition & Examples
  • What Is Hindsight Bias? | Definition & Examples
  • What Is Implicit Bias? | Definition & Examples
  • What Is Information Bias? | Definition & Examples
  • What Is Ingroup Bias? | Definition & Examples
  • What Is Negativity Bias? | Definition & Examples
  • What Is Nonresponse Bias? | Definition & Example
  • What Is Normalcy Bias? | Definition & Example
  • What Is Omitted Variable Bias? | Definition & Examples
  • What Is Optimism Bias? | Definition & Examples
  • What Is Outgroup Bias? | Definition & Examples
  • What Is Overconfidence Bias? | Definition & Examples
  • What Is Perception Bias? | Definition & Examples
  • What Is Primacy Bias? | Definition & Example
  • What Is Publication Bias? | Definition & Examples
  • What Is Recall Bias? | Definition & Examples
  • What Is Recency Bias? | Definition & Examples
  • What Is Response Bias? | Definition & Examples
  • What Is Selection Bias? | Definition & Examples
  • What Is Self-Selection Bias? | Definition & Example
  • What Is Self-Serving Bias? | Definition & Example
  • What Is Status Quo Bias? | Definition & Examples
  • What Is Survivorship Bias? | Definition & Examples
  • What Is the Affect Heuristic? | Example & Definition
  • What Is the Egocentric Bias? | Definition & Examples
  • What Is the Framing Effect? | Definition & Examples
  • What Is the Halo Effect? | Definition & Examples
  • What Is the Hawthorne Effect? | Definition & Examples
  • What Is the Placebo Effect? | Definition & Examples
  • What Is the Pygmalion Effect? | Definition & Examples
  • What Is Unconscious Bias? | Definition & Examples
  • What Is Undercoverage Bias? | Definition & Example
  • What Is Vividness Bias? | Definition & Examples

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Eliminating Explicit and Implicit Biases in Health Care: Evidence and Research Needs

Monica b. vela.

1 Department of Medicine, Section of Academic Internal Medicine, University of Illinois College of Medicine in Chicago, Chicago, Illinois, USA

Amarachi I. Erondu

2 Department of Internal Medicine and Pediatrics, University of California, Los Angeles Medical Center, Los Angeles, California, USA

Nichole A. Smith

3 Department of Internal Medicine, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA

Monica E. Peek

4 Department of Medicine, Section of General Internal Medicine and Chicago Center for Diabetes Translation Research, University of Chicago, Chicago, Illinois, USA

James N. Woodruff

5 Pritzker School of Medicine, University of Chicago, Chicago, Illinois, USA

Marshall H. Chin

6 Department of Medicine and Chicago Center for Diabetes Translation Research, University of Chicago, Chicago, Illinois, USA

AUTHOR CONTRIBUTIONS

Health care providers hold negative explicit and implicit biases against marginalized groups of people such as racial and ethnic minoritized populations. These biases permeate the health care system and affect patients via patient–clinician communication, clinical decision making, and institutionalized practices. Addressing bias remains a fundamental professional responsibility of those accountable for the health and wellness of our populations. Current interventions include instruction on the existence and harmful role of bias in perpetuating health disparities, as well as skills training for the management of bias. These interventions can raise awareness of provider bias and engage health care providers in establishing egalitarian goals for care delivery, but these changes are not sustained, and the interventions have not demonstrated change in behavior in the clinical or learning environment. Unfortunately, the efficacy of these interventions may be hampered by health care providers’ work and learning environments, which are rife with discriminatory practices that sustain the very biases US health care professions are seeking to diminish. We offer a conceptual model demonstrating that provider-level implicit bias interventions should be accompanied by interventions that systemically change structures inside and outside the health care system if the country is to succeed in influencing biases and reducing health inequities.

1. INTRODUCTION

Although expressions of explicit bias have declined in the United States over time, implicit bias has remained unrelenting. Health care providers hold negative explicit and implicit biases against many marginalized groups of people, including racial and ethnic minoritized populations, disabled populations, and gender and sexual minorities, among others ( 29 , 63 ). Implicit bias permeates the health care system and affects patients via patient–clinician communication, clinical decision making, and institutionalized practices ( 78 ). Higher education systems, including medical schools and academic hospitals, have been affected by the discrimination and bias that have long permeated the health care delivery system ( 84 , 104 ). Bias in admissions and promotions processes, in classroom and bedside instruction, and by health care providers contributes to the constant messaging that stereotypes and isolates marginalized groups ( 80 , 102 , 105 ). These biases hinder improvement in compositional diversity of health care providers, long recognized as an important mechanism in reducing health care disparities ( 60 ). This complex system of discrimination and biases causes devastating health inequities that persist despite a growing understanding of the root causes and the health care system’s professional, ethical, and moral responsibility to address these inequities.

It has been theorized that implicit bias and structural racism mutually reinforce one another—ambient structural racism and its outcomes reinforce an individual’s psychological associations between racial identity and poorer outcomes (implicit bias) ( 20 , 21 ). Inequitable structural determinants have diminished housing, education, health care, and income and have increased exposure to environmental pollutants and chronic stressors for marginalized populations ( 76 , 108 ). Structural inequities and discrimination have created stereotypes of marginalized populations or communities and implicit and explicit biases toward them. Health care providers hold negative explicit and implicit biases against racialized minorities. A similar reinforcing dynamic may exist for marginalized populations such as those who are overweight/obese, use wheelchairs, have limited English proficiency, have mental health illness, and belong to lower socioeconomic classes ( 29 ). These biases can facilitate the creation and perpetuation of discriminatory systems and practices, creating a complex feedback loop that sustains itself.

Addressing bias remains a fundamental professional responsibility of health care and public health professionals accountable for population health and wellness ( 64 , 65 ). This article ( a ) provides an overview of existing evidence of bias among health professionals, health practitioners, and public health workers in the practice and training environments (and lay health workers as appropriate) and its impact on health disparities; ( b ) systematically reviews the extant literature for evidence and limitations of current interventions designed to reduce or manage biases; ( c ) explores the interaction between bias and structural elements of the health care system (including medical education); and ( d ) proposes a conceptual model that frames bias not as an independent factor in the generation of disparities but as one element of a reinforcing system of elements that perpetuates such disparities. Ultimately, we provide evidence that interventions designed to reduce or manage existing explicit and implicit biases in clinical settings and public health are insufficient and will continue to fall short in reducing health inequities if we do not concomitantly address the racism and discrimination ingrained in health, medical educational systems, and other societal structures.

2. BACKGROUND

2.1. overview of bias.

Critical to an understanding of interventions that address explicit and implicit biases in health care is an understanding of key terminology, tools used to measure bias, and the evidence for and impact of these biases in health care.

2.1.1. Key terminology: What are implicit and explicit biases?

Implicit biases are unconscious mental processes that lead to associations and reactions that are automatic and without intention; actors have no awareness of the associations with a stimulus ( 41 , 43 ) ( Table 1 ). Axt et al. ( 4 ) maintain that social status is relational and people unconsciously hold more negative attitudes or feelings about membership of an outgroup (people with whom they do not share identities) than about membership of an ingroup (people with whom they share identities). A stereotype is a fixed set of attributes associated with a social group ( 49 ).

Terminology of bias

TermDefinition
DiscriminationDiscrimination is “the result of either implicit or explicit biases and is the inequitable treatment and/or impact of general policies, practices, and norms on individuals and communities based on social group membership” ( , p. S5).
EthnicityEthnicity is “a social system defining a group that shares a common ancestry, history or culture with some combination of shared geographic origins, family patterns, language, or cultural norms, religious traditions, or other cultural and social characteristics” ( , p. 325).
Explicit biasExplicit forms of bias include “preferences, beliefs, and attitudes of which people are generally consciously aware, endorsed, and can be identified and communicated” ( , p. 1).
Hidden curriculum“Lessons taught through socialization of learners especially as it pertains to professionalism, humanism, and accountability, as opposed to explicitly taught in the classroom or bedside” ( , p. 50).
Implicit biasImplicit biases are “unconscious mental processes that lead to associations and reactions that are automatic and without intention and actors have no awareness of the associations with a stimulus. Implicit bias goes beyond stereotyping to include favorable or unfavorable evaluations toward groups of people.” While we are not aware these implicit biases exist, they have a significant impact on decision making ( , p. 14).
Institutional racismInstitutional racism (structural) “refers to the processes of racism that are embedded in laws (local, state and federal), policies, and practices of society and its institutions that provide advantages to racial groups deemed superior while differentially oppressing, disadvantaging or otherwise neglecting racial groups viewed as inferior” ( , p. 107).
Race“Race is primarily a social category, based on nationality, ethnicity, phenotypic or other markers of social difference, which captures differential access to power and resources in society. It functions on many levels and socializes people to accept as true the inferiority of nondominant racial groups leading to negative normative beliefs (stereotypes) and attitudes (prejudice) toward stigmatized racial groups which undergird differential treatment of members of these groups by both individuals and social institutions” ( , p. 106).
Racism“Racism is an organized social system in which the dominant racial group, based on an ideology of inferiority, categorizes and ranks people into social groups called ‘races’ and uses its power to devalue, disempower, and differentially allocate valued society resources and opportunities to groups defined as inferior... A characteristic of racism is that its structure and ideology can persist in governmental and institutional policies in the absence of individual actors who are explicitly racially prejudiced” ( , p. 106).
Role modelingRole modeling is a mechanism for teaching behavior through learning by observation ( , p. 26).
StereotypeA stereotype is “a fixed set of attributes associated with a social group” ( , p. 209).
Stereotype threatStereotype threat “occurs when cues in the environment make negative stereotypes associated with an individual’s group status salient, triggering physiological and psychological processes that have detrimental consequences for behavior” and performance of the individual who identifies as a member of the stereotyped group ( , p. S169).

Implicit bias goes beyond stereotyping to include favorable or unfavorable evaluations toward groups of people ( Table 1 ). Although we are not aware these implicit biases exist, they have a significant impact on decision making ( 97 ).

A belief is explicit if consciously endorsed ( 43 ). Explicit forms of bias include preferences, beliefs, and attitudes of which people are generally consciously aware, personally endorse, and can identify and communicate ( 22 ). Discrimination, the result of either implicit or explicit biases, is the inequitable treatment and/or impact of general policies, practices, and norms on individuals and communities based on social group membership ( 65 , 76 ). Daumeyer et al. ( 22 ) argue that implicit biases must be exposed and discussed so that people and institutions can be held accountable for their effects. They argue for nuanced conversations about the ways in which implicit biases shape behavior and the ways to combat it.

2.1.2. Tools used to measure implicit bias: How good are these measures? Have they been used outside of medicine?

In 1998, Greenwald et al. ( 45 ) described a word association test that identified implicit stereotype effects through indirect reaction time measures even when subjects self-reported low measures of prejudice. Since then, the implicit association test (IAT) has consistently demonstrated implicit stereotyping for a range of different social categories, particularly gender and ethnicity ( Table 1 ). Greenwald et al. ( 42 ) maintain that statistically small effects of the IAT can have socially large effects. A meta-analysis by Greenwald et al. ( 45 ) demonstrated the predictive validity of the IAT regarding implicit stereotype associations to behavioral outcomes across a range of social subject areas. Some critics challenge whether the IAT measures implicit bias and predicts behavior, and question its utility in clinical and other real-world situations ( 3 , 69 ). Most researchers agree that the IAT has limitations ( 44 ). It does not have high test-retest reliability in the same individual, and it is not useful as a tool to label individuals as implicitly sexist or racist or to predict behavior ( 73 ). The IAT has been used in health professions education as a metric to demonstrate the efficacy of educational interventions meant to reduce implicit bias and as a tool to raise awareness of existing implicit bias among health care trainees and providers ( 101 ).

2.1.3. Implicit biases in health care: What is the evidence for racial bias among health care professionals? What is the impact of such bias in health care?

Implicit racial and ethnic bias exists among health care professionals in favor of White patients and against Black, Hispanic, and dark-skinned patients even when all other major factors (e.g., socioeconomic differences, insurance status) have been controlled and accounted for. Hall et al. ( 47 ) published a systematic literature review of 15 studies designed to explore the evidence of provider implicit racial bias and health outcomes. In the studies measuring prevalence, rates of anti-Black bias in health care providers ranged from 42% to 100%. These findings were redemonstrated in similar reviews conducted in 2017 ( 29 ) and 2018 ( 63 ).

Hoffman et al. ( 50 ) demonstrated in 2016 that White medical students and residents were more likely to believe that Black patients had thicker skin and smaller brains, and were more likely to rate Black patients as feeling less pain than and not needing the same levels of pain medications as White patients. Several studies have demonstrated that negative implicit biases held by those in the health professions are similar to those seen in the lay population ( 29 ).

The Medical Student Cognitive Habits and Growth Evaluation Study (CHANGES) has provided the greatest insight into the implicit and explicit biases held by medical students and trainees in the United States. This longitudinal multimeasure study followed a large sample of students attending a stratified random sample of 49 US allopathic medical schools and measured associations between possible interventions and levels of biases held by students. A web-based survey completed by more than 4,500 first-year medical students demonstrated that most students exhibited implicit (74%) and explicit (67%) weight bias. The study also demonstrated that scores of implicit weight bias were similar to scores of implicit bias against racial minorities (74%) in the same group of students ( 86 ). The size and scope of this study demonstrate undeniable evidence that implicit bias is pervasive among medical students, even in the first year of medical school. The multiple papers and findings generated by this foundational study were excluded from the final selection of studies in the results section because the study was observational and did not introduce interventions.

Biases affect health care delivery and public health outcomes, the health professions workplace and learning environments, and the diversity of trainees and workforce ( Table 2 ). Hall et al. ( 47 ) demonstrated that these implicit biases have negatively affected patient–provider interactions, treatment decisions, and patient adherence to treatment. The most consistent evidence is found in studies of patient–provider interactions in which the bias of health care providers has been repeatedly linked to discriminatory care ( 18 )—patients rate physicians with higher levels of implicit bias as less patient-centered in the primary care setting. Blanchard & Lurie ( 6 ) demonstrated that patients who perceived that they would have received better treatment if they were of a different race were significantly less likely to receive optimal chronic disease screening and more likely to not follow the doctor’s advice or to delay care. In a large study of adult primary care, higher implicit bias among health care providers was associated with patients’ lower ratings of interpersonal treatment, contextual knowledge, communication, and trust ( 5 ).

Impacts of implicit bias

AreaImpacts
Health care deliveryPatient-provider communication
Patient-provider relationships
Patient satisfaction
Patient perception of physician’s patient-centeredness
Patient treatment adherence
Provider decision making
Provider’s perspective of patient’s likelihood to adhere to treatment
Public healthResource allocation (testing locations, vaccine distribution, location of environmental stressors)
Health professions workplace and learning environmentsPromotions practices
Compensation
Evaluations
Awards and recognition
Research grants
Stress, isolation
Diversity of trainees and workforceRecruitment and selection of future trainees
Inclusive learning environment

Other studies have confirmed associations between provider bias (demonstrated via IAT testing) and disparate treatment of their patients ( 63 ). In a systematic literature review, six studies found that higher implicit bias among health care providers was associated with disparities in treatment recommendations, expectations of therapeutic bonds, pain management, and empathy ( 63 ). Seven studies that examined the impact of implicit provider bias on real-world patient–provider interaction found that health care providers with stronger implicit bias demonstrated poorer patient–provider communication and that health care providers with high implicit biases ( a ) provided lower rates of postoperative narcotic prescriptions for Black children than for White children ( 93 ), ( b ) had poorer bonding with Black patients than with White patients ( 55 ), and ( c ) made disparate recommendations for thrombolytic therapy for Black patients and White patients ( 40 ).

A study of 3,756 students at 49 US medical schools demonstrated that high scores of racism as measured by the three variables were significantly correlated with low scores of student intentions to work in underserved areas and to provide care to minority populations ( 74 ).

Implicit bias affects not only patients but also trainees and faculty within health care systems. A 2014 systematic literature review revealed that rates of harassment and discrimination against trainees (24% reported racial discrimination, 33% reported sexual harassment, and 54% reported gender discrimination) have remained unchanged over time ( 31 ). Minority trainees report facing daily bias and microaggressions and having feelings of isolation and substantial stress ( 74 ). Minority medical students reported five-times-higher odds of racial discrimination and isolation than did nonminority peers ( 26 ). Stereotype threat (defined in Table 1 ) is common, particularly among non-White students, interferes with learning, and adds to the cognitive load of minoritized students ( 9 ). Thus, bias in health professions training can affect the performance of racialized minorities. Early and small differences in assessed clinical performance, which may be affected by implicit biases, lead to larger differences in grades and selection for awards [e.g., Alpha Omega Alpha Honor Medical Society (AOA)], ultimately affecting career trajectories of racial minority candidates ( 102 ). For example, significant differences in negative descriptive words on medical students’ evaluations have been found across different racial and gender groups ( 91 ). Membership in AOA, conferred to only 16% of each graduating medical school class, has effectively barred diversity in many specialties and may represent a longstanding form of structural racism ( 7 ).

2.2. Impact of Interventions Designed to Reduce or Manage Bias

Literature outside of health care has introduced techniques to manage implicit bias, including stereotype replacement (replacing stereotypical responses to bias with nonstereotypical ones), counter-stereotypic imaging (imagining known counter-stereotypical people), individuation (learning personal attributes of persons present rather than identifying group attributes), perspective taking (taking the perspective of persons present), and increasing opportunities for contact. Several studies have explored the efficacy of these interventions. Strikingly, the only study demonstrating reduction of measured implicit bias was conducted on undergraduate students enrolled in a course using a prejudice-habit-breaking intervention involving instruction of all the aforementioned techniques with effects lasting 8 weeks ( 24 ). Unfortunately, these results may not be generalizable and have not been reproduced. Lai et al. ( 57 ) tested nine interventions and although all immediately reduced implicit preferences, results were sustained for only several hours to days. FitzGerald et al. ( 30 ) conducted in 2019 a systematic review of bias interventions utilizing the IAT or other measures across multiple disciplines. They found that most studies did not provide robust data to support many interventions, although perspective taking was more successful than counter-stereotypic imaging.

2.3. Interactions Between Bias and Structural Elements of the Health Care System

Implicit bias has important interactions with structural elements of the health care system. Evidence suggests that implicit bias can reinforce structural dimensions of the health care system that generate disparities. Other evidence suggests that structural dimensions of the health care system and medical education can reinforce implicit bias. These interactions suggest a complex and mutually reinforcing relationship between implicit bias and structural elements of the health care system.

2.3.1. The relationship between implicit bias and public policy.

Implicit biases influence the decisions of policy makers in government and health care that result in structural racism ( 70 , 75 , 81 ). Public health responses to the coronavirus disease 2019 (COVID-19) pandemic offer evidence of this dynamic. Despite data demonstrating that non-Hispanic Black populations and Hispanic populations were dying at a younger average age (71.8 years and 67.3 years) than non-Hispanic White patients were (80.9 years), the phase 1b vaccination strategy targeted individuals age 75 and older ( 25 ). Thus, federal public health recommendations ignored or discounted the evidence that an age-based approach would lead to further disparities in COVID-19 infections and mortality, amounting to structural racism against Black and Hispanic populations.

2.3.2. The relationship between implicit bias and cognitive workload: overcrowding and patient load.

Studies have consistently shown that decision makers burdened with higher cognitive load are more likely to make biased decisions ( 10 ). A more recent study of physicians in the emergency department has confirmed that cognitive stressors such as patient overcrowding and patient load were associated with increased implicit racial bias as measured by a race IAT preshift compared to postshift ( 53 ).

2.3.3. The relationship between implicit bias and the learning/training environment.

Unfortunately, to date, medical education and educators have not adequately addressed the implicit biases that place marginalized patients at high risk of receiving disparate care and suffering poorer health outcomes. In fact, Phelan et al. ( 84 ) concluded that structural racism is at play in medical education through many medical schools’ formal and hidden curricula ( 52 , 88 ). In contrast to a formal curriculum, which can be measured by the number of hours students receive training related to racial disparities and bias, structured service-learning, minority health activities, cultural awareness programming, and the completion of an IAT, the hidden curriculum is unofficial and often more powerful, consisting of faculty role modeling ( 52 ), institutional priorities around the interracial climate, and experiences of microaggressions.

Most medical students continue to believe that both race and gender (as opposed to sex) are genetic and biological constructs. Even when students are taught otherwise, the practice of race-based medicine reinforces these characterizations. When students are taught about health disparities without the appropriate contextualization of structural racism, historic segregation, the pathologization of gender and sexual orientation, and the medical professions’ complicity in scientific racism, students may assume there is something inherently wrong with racialized minorities rather than with the systems that have harmed them. Students are often taught that race, instead of racism, is an independent risk factor for disease. They learn to associate race with any number of diseases. They are taught to incorporate the race of their patient into the opening line of clinical presentations even though there is no evidence that race is relevant to the establishment of diagnoses. They learn to use race-based algorithms to calculate glomerular filtration rates, pulmonary function testing, hypertension guidelines, and even urinary tract infection diagnoses in pediatric populations ( 2 ). Such messaging only serves to undo any structured teaching on the social construct of race and gender ( 16 ).

2.3.4. The relationship between implicit bias and health care outcomes.

As discussed above, there is substantial evidence that implicit bias results in health care disparities through mechanisms including disparate care and trust. But the relationship between implicit bias and outcomes may be bidirectional. Evidence has shown that implicit attitudes are malleable and that such attitudes are learned and strengthened through repeated observation of particular classes of people in valued or devalued circumstances. For example, individuals exposed to less favorable exemplars from a given identity demonstrate increased implicit bias and stereotypes with respect to that entire group ( 20 ). Furthermore, these investigators showed that changing exposure to more favorable exemplars can diminish established implicit bias. This phenomenon has been demonstrated in experiments looking specifically at race- and age-related attitudes ( 21 ). These findings suggest that a practitioner’s implicit bias toward a marginalized group may be augmented or diminished by the clinical outcomes of that group.

2.3.5. Favorable relationships between structural elements of training and bias: curricula, climate, and contact.

The CHANGES study demonstrated that students’ implicit bias against sexual minorities was reduced at 42 medical schools and increased at only 7 schools. Reduced bias was associated with more frequent interaction with LGBT students, faculty, and patients; the perceived quality of that contact; and increased training involving skills in caring for sexual minorities ( 85 ).

The CHANGES study found that changes in student implicit racial attitudes were independently associated with formal curricula related to disparities in health and health care, cultural competence, and minority health; informal curricula (or hidden curricula, defined in Table 1 ), including racial climate and role model behavior; and the amount and favorability of interracial contact during medical school ( 84 ).

Thus, carefully designed structural elements of the learning environment can favorably affect the implicit biases and wellness of students.

2.4. Systematic Review of Studies with Interventions

A systematic literature review was performed with the goal of assessing the efficacy of extant interventions designed to reduce the explicit and implicit biases of health care providers and of learners across the continuum of health professions education.

2.4.1. Methods.

We searched three databases (ERIC, PubMed, and MedEdPORTAL) using key terms ( Figure 1 ). The terms “implicit bias,” prejudice,” and “stigma” were often used inter-changeably and the terms “bias” and “biases” yielded more than 100,000 articles, often with little relevance to implicit bias in the health professions. We found, as did FitzGerald et al. ( 30 ) in their systematic review, that indexing in databases for these terms was inconsistent and that titles and abstracts were often imprecise. We conducted repeated searches with and without these terms, comparing the number of search results. We developed a set of terms most frequently encountered in the titles and abstracts of irrelevant articles and defined important terminology ( Table 1 ) to narrow the search. We reviewed the references of landmark articles and used the advanced search function to increase the likelihood that no key articles were missed.

An external file that holds a picture, illustration, etc.
Object name is nihms-1812351-f0001.jpg

PRISMA flow diagram of the systematic review.

A study had to include health care professionals, assess an intervention (e.g., training, workshop, didactics, contact, program) designed to address explicit or implicit bias held by health care providers, be written in English, and be published between May 2011 and May 2021. We excluded commentaries, theoretical frameworks, editorials, and institutional or societal pledges that address racism, although these were reviewed for context. We did not exclude qualitative studies, studies without comparison groups, or studies outside North America. However, although we did find studies from other countries detailing explicit and implicit biases, we did not find articles with interventions addressing these biases for inclusion in this review. We extracted subjects, intervention format (e.g., lectures, workshops, discussions, panels, interviews), target (e.g., knowledge, skills, attitudes, IAT), and summary of key findings.

We excluded abstracts that did not include original research or bias reduction as an expected outcome; that did not employ a discrete intervention or, like the CHANGES study, retrospectively identified effective interventions; or that studied populations other than health professions students, trainees, or providers. We excluded articles that focused on self-stigma (e.g., from a diagnosis of obesity, HIV, sexually transmitted infection, mental health) and community-based interventions, as they were not focused specifically on the bias of health professionals. Observational studies without discrete interventions were excluded but were reviewed in Section 1 .

Title, abstract, and full-text review were conducted by three authors (M.B.V., A.I.E., and N.A.S.) and coded to consensus.

2.4.2. Findings.

Twenty-five studies met inclusion criteria ( Table 3 ). None of the studies mentioned in Sections 1 and 2 met inclusion criteria but were reviewed because of their significant contributions to the understanding of the interactions of implicit bias in learning and clinical settings. Most studies (68%) engaged medical students and utilized classroom or web-based interventions. Most studies did not have a control group (72%) and none used actual clinical settings. Three studies focused on interventions for implicit bias of faculty serving on admissions or search committees.

Provider-level implicit bias interventions

Study populationInterventionEvaluation/outcomesLimitationsReference
Interventions without formal measurement of implicit bias/attitudesMedical students ( = 25)Study and control groups Study group participated in 5-h dialogues on race and biasPre- and postsurveys
Paired -tests demonstrated increased knowledge and awareness of racial bias and increased comfort talking about race.
No formal bias measure
Self-selected study group of students
Faculty who serve on search committees ( = 22)2-h reflection-based workshop on unconscious biasPost-intervention survey evaluated effectiveness and utility of exercise.
Most surveyed found workshop helpful in preparing for faculty searches.
Extremely limited evaluation (no pre-/postcomparison)
No formal bias measure
Medical students = 615)2-day orientation on power, privilege, and biasPost-intervention survey Surveys demonstrated raised bias awareness.No formal bias measure
No pre-/postcomparison
Medical students ( = 187)Five 2-h workshops with lectures on biasPre- and postsurveys
Paired -tests on surveys demonstrated raised awareness of own biases and intent to address bias.
No formal bias measure
Health professions educators = 70)Introduced new longitudinal case conference curriculum called HER to discuss and address the impact of structural racism and implicit bias on patient care
Utilized case-based discussion, evidence-based exercises, and two conceptual frameworks
Tracked conference attendance and postconference surveys
Most survey respondents (88% or more) indicated that HER promoted personal reflection on implicit bias, and 7 5 % or more indicated that HER would affect their clinical practice.
No pre-/postcomparison
No formal bias measure
No control group
Faculty = 66)90-min interactive workshop that included a reflective exercise, role-play, brief didactic session, and case-based discussion on use of language in patient chartsPost-intervention survey with four Likert scale questions
Participants felt workshop met its
objectives (4.8 out of 5.0) and strongly agreed that they would apply skills learned (4.8).
Self-selected study group
No measure of bias
No control group
Family medicine residents ( = 31)Training on institutional racism, colonization, and cultural power followed by humanism and instruction on taking health equity time-outs during clinical timeFocus groups conducted 6 months post-intervention
Four themes:
No measure of bias
No pre-/postcomparison
Qualitative analysis only
No control group
Medical students ( = 26)Service-learning plus reflectionReflection practice questionnaire analysis
Students reported recognizing and mitigating bias.
No formal measure of bias used
No control group
Medical students ( = 127)Readings/reflections on weight stigma
Standardized patient before and after
Pre-/post-intervention questionnaires
Reduced stereotyping, increased empathy, and improved counseling confidence
Weak analysis may be biased itself.
No formal bias measurement
No control group
Interventions with formal measurement of implicit bias/attitudesMedical students/elective ( = 218)Single session in which students completed an IAT followed by discussionPost IAT survey
Implicit bias deniers were significantly more likely to report IAT results with implicit preferences toward self, to believe the IAT is invalid, and to believe that doctors and the health care system provide equal care to all, and were less likely to report having directly observed inequitable care.
Self-selected study group
No control group
Medical students ( = 180)Single IAT administration followed by guided reflective discussion and essay writingEvaluation of reflective essays
Students noted raised awareness of bias but were not able to strategize solutions to mitigate bias.
Prompt did not ask for strategies
No control group
Medical students ( = 15)Nine 1.5-h sessions focused on promoting skills to empower students to recognize implicit bias reduction as part of professionalism
Three objectives (grounded in implicit bias recognition and transformative learning theory):
Post-intervention focus groups and analysis of semistructured interviews
Major themes:
Self-selected small group of students
No control group
Medical students ( = 72)IAT administration followed by small group debrief and discussion on biasQualitative analysis of discussion transcripts
Students who reach for normative versus personal standards had higher implicit bias post-intervention.
No post IAT measure of bias
No control group
Nursing students ( = 75)Pre/post IAT with debriefing, writing, and teaching of bias management techniques (e.g., internal feedback, humanism)Postclass survey, conducted 5 weeks after the intervention
Learners were extremely likely or likely to ( ) take additional IATs and reflect on the results and ( ) learn more about unconscious bias.
No formal analysis of pre/post IATs, but focus was on acceptance of bias and management
No control group
Medical students ( = 78)Workshops that involved IAT administration, instruction on implicit bias and impact on decision making, and presentation of six strategies to reduce implicit biasReduction of implicit bias against Hispanics as measured by an IAT in majority students only
No change for minority students was demonstrated.
No control group
Nonclinical setting
Medical students, house staff, faculty ( = 468)Twenty workshops to emphasize skill building and include lectures, guided reflections, and facilitated discussions focused on the following: Survey response rate was 80%; a paired -test
Pre- and postsurveys to evaluate the intervention’s capacity to improve awareness of bias and address it through allyship
Demonstrated greatest improvements in understanding of the process of allyship; ability to describe strategies to address, assess, and recognize unconscious bias; and knowledge of managing situations in which prejudice, power, and privilege are involved
Improved confidence in addressing bias but no measure of bias reduction
Faculty on admissions committee ( = 140)Black-White IAT administered before 2012–2013 medical school admission cycle
Study participants received results before start of admission cycle and were surveyed on the impact at the end of cycle in May 2013
Most survey respondents (67%) thought the IAT might be helpful in reducing bias, 48% were conscious of their individual results when interviewing candidates in the next cycle, and 21 % reported knowledge of their IAT results impacted their admissions decisions in the subsequent cycle.
This class is the most diverse to
matriculate in the Ohio State University College of Medicine’s history.
Unclear whether other factors affected matriculation of students
Faculty members ( = 281)Standardized, 20-min educational intervention to educate faculty about implicit biases and strategies for overcoming themPre-/postassessments that included the following: The intervention had a small but significant effect on the implicit biases surrounding women and leadership of all participants regardless of age and gender.
Faculty experienced significant increases in their perceptions of personal bias (Cohen’s = 0.50 and 0.17; < 0.01 for both questions), perceptions of societal bias (Cohen’s = 0.14, 0.12, and 0.25; < 0.05 for all three questions), and perceptions of bias in academic medicine (Cohen’s = 0.38, 0.57, and 0.58; < 0.001 for all three questions).
Immediate impact only
No control group
Medical students ( = 64)Study participants watched video linking obesity to genetics and environmentBeliefs about Obese Persons, Attitudes toward Obese Persons, and Fat Phobia Scales administered pre- and post-intervention
Paired -tests revealed decreased negative stereotypes and beliefs.
No longitudinal results
No control group
House staff ( = 69)Narrative photography to prompt reflection and photovoice of Latino adolescentsControl and intervention groups
Measured ethnocultural empathy, health care empathy, patient centeredness, and implicit attitudes using the affect misattribution procedure
All measures improved with some note of dose response with more exposure.
Nonclinical setting
Medical students ( = 129)Workshop to address obesity-related bias using theater reading (intervention group) of play versus lecture (control group) on obesity
Students randomly assigned to groups
Obesity-specific IAT, anti-fat attitudes questionnaire pre-/postworkshop
Reduced explicit fat bias in theater group with no change in implicit bias or empathy post-intervention or 4 months later
Nonclinical setting
Primary care providers ( = 185)Study participants randomized to intervention (lecture and contact)/control (lecture and discussion)Beliefs and Attitudes towards Mental Health Service Users’ Rights Scale
Reduced stigmatizing beliefs and attitudes at 1 month in the intervention group but rebound effect at 3 months
No formal measure of bias
Nonclinical setting
Medical students ( = 111)One-time contact-based educational intervention on the stigma of mental illness among medical students and compared this with a multimodal undergraduate psychiatry courseOpening Minds Scale for Health Care Providers to assess changes in stigma
Stigma scores for both groups were significandy reduced upon course completion ( < 0.0001) but were not significandy changed following the one-time contact-based educational intervention in the primary analysis.
Nonclinical setting
Medical students ( = 160)Intergroup contact theory (facilitated contact to reduce bias) plus 50 h of competency-based curriculum on inclusive care of LGBTQ and gender-nonconforming individuals through lectures, standardized patients, discussion, panels, and reflective writingHad study and control groups
Pre and post IATs with debriefings demonstrated reduced implicit preference for straight people.
IAT with debriefings were important when used to facilitate curriculum.
Nonclinical setting
Medical students ( = 50)Three cultural competency training sessions led by LGBTQ2S+ experts and elders from the community
Study participants randomized to intervention and control groups
Focus group discussions conducted
Pre-/postassessment
Lesbian, Gay, and Bisexual Knowledge and Attitudes Scale for Heterosexuals and The Riddle Scale: Attitudes towards Gay, Lesbian, Bisexual, and Trans people survey
Measurable and relevant changes in health care students’ perceived knowledge, attitudes, and clinical behavior regarding LGBTQ2S+ populations as a result
Nonclinical setting

Abbreviations: HER, Health Equity Round; IAT, implicit association test.

3. DURATION OF INTERVENTION EFFECT

The three studies of faculty serving on admissions or search committees reported increased awareness of biases, but none reported bias reduction or long-lasting impact.

Three studies followed subjects 3, 4, and 6 months post-intervention, but only one noted a lasting positive impact ( 96 ).

4. NOVEL INTERVENTION CONTENT

All studies addressing implicit bias among health care providers raised awareness of implicit bias through didactic instruction, discussions, workshops or other reflection-based techniques (e.g., service-learning, photovoice, contact-based interventions, theater reading; see Table 4 ), or an IAT or similar measure.

Definitions of intervention types used in selected studies

Intervention typeDefinition
Allyship training“An active, consistent, and arduous practice of unlearning and re-evaluating, in which a person of privilege seeks to operate in solidarity with a marginalized group” ( )
“Allyship begins with an awareness of unconscious biases and then moves to actions that address inequities in everyday interactions to create an inclusive culture for example to amplify the voices of those in underrepresented groups and to advocate for equitable practices” ( , p. 6).
Bias literacyPromotes a basic understanding of key terms, skills and concepts related to bias as a first step to organizational change ( , p. 64; , p. 22)
Brave space“A space where difficult, diverse, and often controversial issues are presented and can be discussed with a common goal of understanding the barriers to equity in health care” ( , p. 87)
Emotional regulation“The processes by which we influence which emotions we have, when we have them, and how we experience and express them” ( , p. 282)
Intergroup contactThe promotion of contact between two groups with the goal of reducing prejudice ( , p. 66)
Photovoice“A method that allows participants to use photography to document their experiences and dialogue to eventually influence change” ( , p. 318)
Service-learningA “pedagogy of engagement wherein students address a genuine community need by engaging in volunteer service that is connected explicitly to the academic curriculum through structured ongoing reflections” ( , p. 115)
Theater readingPlay reading with students as active participants ( , p. 232)

Despite the limitations noted in Section 2 , the IAT continues to be widely utilized. The IAT and other measures ( 32 ) of implicit bias, stigma, and attitudes toward groups of persons were used among subjects to ( a ) demonstrate the existence of participant implicit biases, ( b ) act as a springboard to create cognitive dissonance for oral and/or written reflection and to practice bias management skills, and ( c ) evaluate interventions. Gonzalez et al. ( 37 ) found that using the IAT without priming on its results and without a follow-up debriefing led some subjects (22%) to question the validity of the measure and the existence of implicit biases, and therefore advised judicious use of the IAT and trained facilitators. Subjects who accepted the results of the IAT were not able to develop management strategies for those biases without dedicated instruction.

Despite having low explicit bias based on a self-reported survey, admissions committee members at The Ohio State University College of Medicine ( 14 ) had high levels of implicit preference for White versus Black students as measured by the Black-White IAT. Results were presented to committee members with strategies to reduce implicit bias. The following admissions cycle resulted in an increase in underrepresented minority matriculation from 17% to 20%, a change that was not statistically significant.

Seventy-six percent of studies ( 8 , 13 , 14 , 23 , 28 , 35 – 38 , 48 , 51 , 58 , 59 , 77 , 82 , 94 , 96 , 99 , 109 ) instructed on structural determinants such as structural racism and/or historic oppression of groups so that subjects could explore explicit and implicit biases. All these studies demonstrated an increased awareness of bias, and subjects often voiced a willingness to address their biases. Four studies explored the use of contact with groups with identities such as LGBTQI ( 58 , 59 ) and persons with mental illness ( 27 , 77 ) with positive and negative results, respectively.

In recognition that biases may be immutable in the current health care context but can be managed, educators have used transformative learning theory (TLT) in concert with implicit bias management techniques. TLT transforms the individual’s existing paradigm by disrupting assumptions and then engaging in critical reflection and dialogue to interpret the disruptions ( 68 ). TLT may move learners to an “inclusive, self-reflective and integrative frame of reference” ( 100 , p. 718). This paired approach has had early success. Sherman et al. ( 96 ) engaged both residents and faculty in transformative learning to address issues of race, racism, and Whiteness and created an environment for critical dialogue incorporating practical recommendations for addressing implicit bias in clinical practice. Focus groups 4 months later revealed that subjects noted increased awareness of their biases and sustained commitment to addressing racial bias, to challenging their own clinical decision making, and to engaging leadership in dialogue regarding bias.

Gonzalez et al. ( 38 ) describe implicit bias recognition and management (IBRM), a process that promotes conscious awareness of biases and fosters behavioral changes. IBRM supposes that biases are difficult to reduce and should therefore be managed. IBRM has helped medical students interrupt biases in learning and clinical settings. Wu et al. ( 109 ) paired IAT administration with training to improve skills in bias literacy, emotional regulation, and allyship ( Table 4 ). Trainees practiced these skills in clinical vignettes and improved their confidence in addressing bias in real-world settings. All three studies created a brave space to explore biases and emphasized continued practice and development of skills.

These studies have multiple limitations. They often lacked control groups or used pre- and postcomparison designs. They had limited longitudinal follow-up and often were not performed in real-world clinical or learning environments. Many studies did not focus on targeted outcomes, and most did not access the continuum of learners in medical education such as practicing health care providers and leadership. Most interventions had a limited one-time delivery with no opportunity to measure a dose- or time-dependent effect.

5. DISCUSSION

Many of the interventions demonstrated successful promotion of awareness of implicit bias held among subjects as well as an interest in mitigating implicit biases among subjects. No intervention in this review, however, achieved sustained reduction of implicit bias among health care professionals or trainees. In addition, no study demonstrated that an intervention improved clinical outcomes, the learning environment, interprofessional team dynamics, patient care, health disparities, patient satisfaction, or satisfaction of health professionals. Studies were hampered by lack of statistical analysis, lack of control group, limited numbers of participants, findings that are not necessarily generalizable from the classroom or web-based setting to the clinical or real-world setting, and heavy reliance on qualitative assessments or nonvalidated instruments. Future studies should also assess whether regularly timed booster interventions manifest in sustained changes over time and should have longer-term follow-up to assess sustainability of initial gains. Future studies should include educational models that use direct clinical observation or standardized patients. Studies should assess health care trainees’ ability to incorporate skills into patient communication and shared decision making, their improvement of clinical delivery practices, their interactions with colleagues, and their teaching practices.

5.1. Conceptual Model

Based on Jones’s ( 54 ) allegory A Gardener’s Tale , we present a conceptual model of implicit biases of health care providers and the key structural factors affecting these biases ( Figure 2 ). In the vicious cycle of health disparity, students, trainees, and providers receive a constant barrage of messaging that reinforces biases. The soil of their work (practice and learning environments) is laden with structural bias from racialized medicine, a biased learning environment, and poor compositional diversity. Furthermore, these trainees and health care providers are under substantial time pressure and cognitive load. These characteristics of the practice and learning environments may be considered structural determinants of implicit bias.

An external file that holds a picture, illustration, etc.
Object name is nihms-1812351-f0002.jpg

Interactions between structural determinants and provider implicit bias. The vicious cycle: Structural determinants of implicit bias in the practice environment support biased decision making. Structural determinants of health in the community further impair outcomes in marginalized populations, leading to confirmation of the practitioner’s implicit bias. Health disparities are exacerbated. The virtuous cycle: A favorable practice environment regarding structural determinants of implicit bias supports unbiased clinical decision making. Favorable structural determinants of health in the community further enhance patient outcomes, positively reinforcing unbiased practice. Health disparities are reduced.

Biases are now primed as the clinician moves to provide care to patients (see the left side of Figure 2 ). When caring for marginalized patients, the provider’s bias influences communication with the patient, potentially resulting in suboptimal decision making. The patient may sense the bias, may distrust the provider and system, and may decide to not follow through on treatment plans or may modify them. The patient lives in underresourced and unhealthy spaces that contribute to poor outcomes. The provider notes the poor outcomes and their implicit bias is confirmed. Health care disparities are exacerbated. Further exacerbation of the vicious cycle occurs when this dynamic is accompanied with biases toward students, trainees, and providers from marginalized groups. Individuals from these marginalized groups are less likely to succeed, confirming biases about them and perpetuating poor diversity in the health care workforce. The benefits of diversity to education and patient care are lost.

The right side of Figure 2 depicts the virtuous cycle of health equity. A well-resourced provider learning and working within an environment devoid of racialized medicine and bias and characterized by compositional diversity is less likely to display biases against the patient. Compositional diversity also increases the likelihood that the provider shares lived experiences with the patient. The patient notes the absence of provider bias, develops a trusting relationship, adheres to the treatment plan in a well-resourced environment, and returns with improved health outcomes. The patient’s outcome confirms the provider’s more favorable bias. Health care disparities are reduced.

This conceptual model highlights two important dynamics in the perpetuation of implicit bias and its impact on care. First, structural determinants in the health care system and surrounding community contribute to the development of implicit bias toward marginalized patient populations and then reinforce that implicit bias through generation of poorer patient outcomes. Second, interruption of this cycle is possible only through an overall shift toward favorable structural influences on implicit bias. Discrete, time-limited training as the sole intervention to reduce implicit bias is unlikely to result in sustained change; health care providers return to a practice or learning environment that is often replete with structural determinants and patient outcomes that reinforce implicit bias. To avoid the ongoing creation and perpetuation of racist structures in society, systems, and organizations, it is crucial to recognize that these dynamics may enhance the implicit bias of medical leaders and policy makers as well.

5.2. Taking Action

To enable provider-level bias interventions to succeed in improving health outcomes, multiple other concurrent approaches should address structural factors inside and outside the health care system that influence these biases ( 80 ).

Structural inequities outside the health care system include poor access to high-quality health care, racialized violence, the carceral state, crowded housing, healthy food scarcity, lack of access to green spaces, environmental toxins, and poorly protected workspaces, among other issues related to geography and place ( 19 , 103 ).

Structural inequities inside the health care system that prime bias include the work and learning environments of students, trainees, and providers ( 104 ). It will be important to address these structural drivers of bias, including time pressures, cognitive load, and the practice of racialized medicine. Racism, sex and gender discrimination, and other forms of discrimination must be rooted out, as they prevent marginalized trainees and faculty from thriving, create stereotype threat for the marginalized, and confirm bias for the nonmarginalized. Bioethical principles of fairness, distributive justice, and reciprocity should be core for public health officials and health care providers, and practitioner and provider trainings in these areas can raise awareness. For example, to address health inequities laid bare by COVID-19, Peek et al. ( 79 ) recommend a multifactorial approach that acknowledges the systemic racism of the health care system and other societal structures as well as the biases of providers ( 67 ).

Addressing compositional diversity in health care is another avenue for treating the structures that influence implicit and explicit biases and eliminate health care disparities. Minority health professionals are underrepresented in the workforce and health professions faculty ( 60 ). Only 6.2% of medical students identify as Hispanic or Latinx, and only 8.4% as Black or African American ( 1 ). Gender parity among medical school students has been achieved. However, women are underrepresented at the faculty instructor level, with substantially less representation at the professor level, and are also underrepresented in hospital leadership, with even starker inequities for female racial and ethnic minorities ( 33 , 88 ). Gender inequalities in salaries have been well documented ( 12 , 62 , 71 ). In academic medicine, Black male faculty are offered lower rates of compensation than their White counterparts and are less likely to be awarded research funding from the National Institutes of Health ( 34 ). Similarly, in 2016, graduate student enrollment in the Association of Schools and Programs of Public Health demonstrated a ≤5% increase over a 20-year period among Asian, Black, Hispanic, and Native American students; only 11.1% of students were Black and 12% were Hispanic. Black, Hispanic, and Native American representation among tenured public health faculty increased <3% during this same 20-year period ( 39 ).

6. CONCLUSION

TLT, IBRM, and a skills-based approach offer promise for future interventions in implicit bias management. It is also encouraging that discussions around disparities and inequities have moved from race to racism and have focused on the professional responsibility of providers to root out inequities and manage biases. The extant literature regarding the use of provider-level implicit bias interventions suggests that these interventions can play an important role in concert with other interventions that more broadly address bias and discrimination inside and outside the health care system. Evidence supports the use of provider-level interventions in immediate-impact activities such as decision making on search committees or admissions committees and raising critical awareness of the bioethical principles of fairness, distributive justice, and reciprocity. However, provider-level implicit bias interventions alone have not improved health outcomes. Thus, provider-level implicit bias interventions should be accompanied by interventions that systemically change structures inside and outside the health care system that influence biases and perpetuate health inequities.

ACKNOWLEDGMENTS

The authors extend their heartfelt thanks to Debra A. Werner, the University of Chicago’s Librarian for Science Instruction & Outreach and Biomedical Reference Librarian, for her patient guidance and assistance with the systematic literature review, and Morgan Ealey, Administrative Manager, Section of General Internal Medicine, who helped format the manuscript.

DISCLOSURE STATEMENT

M.E.P. and M.H.C. were supported in part by Bridging the Gap: Reducing Disparities in Diabetes Care National Program Office, funded by the Merck Foundation, and the Chicago Center for Diabetes Translation Research, funded by the National Institute of Diabetes and Digestive and Kidney Diseases (P30 DK092949). M.H.C. was also supported in part by Advancing Health Equity: Leading Care, Payment, and Systems Transformation, a program funded by the Robert Wood Johnson Foundation. M.H.C. is a member of the Blue Cross Blue Shield Health Equity Strategy advisory panel, Bristol Myers Squibb Company Health Equity Initiative advisory board, and The Joint Commission and Kaiser Permanente Bernard J. Tyson National Award for Excellence in Pursuit of Healthcare Equity review panel. The other authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

LITERATURE CITED

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 16, Issue 3
  • Blinding: an essential component in decreasing risk of bias in experimental designs
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Dorothy Forbes
  • Correspondence to : Dr Dorothy Forbes Faculty of Nursing, University of Alberta, Level 3, Edmonton Clinic Health Academy (ECHA), 11405-87 Ave, Edmonton, Alberta, Canada T6G 1C9; dorothy.forbes{at}ualberta.ca

https://doi.org/10.1136/eb-2013-101382

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

What is blinding?

Blinding (or masking) is the process used in experimental research by which study participants, persons caring for the participants, persons providing the intervention, data collectors and data analysts are kept unaware of group assignment (control vs intervention). Blinding aims to reduce the risk of bias that can be caused by an awareness of group assignment. With blinding, outcomes can be attributed to the intervention itself and not influenced by behaviour or assessment of outcomes that can result purely from knowledge of group allocation.

Why incorporate blinding?

Performance bias refers to systematic differences between the treatment and control groups resulting from care that was provided, or exposure to factors other than the interventions of interest. After enrolment into the study, blinding of participants and personnel may reduce the risk that knowledge of which intervention was received affects outcomes. If blinding is not incorporated or is unsuccessful, participants may respond better if they know they have received a promising new treatment. On the other hand, if participants are aware that they are not receiving an active treatment they may be less likely to comply with the study protocol, more likely to seek additional treatment and more likely to leave the study without providing outcome data. 5 The healthcare providers who are blinded to participant allocation are much less likely to transfer their values to participants or to provide differential treatment to the active and placebo groups. 5 However, blinding may not be possible in some studies where the intervention is obvious to the participants and/or persons administering the intervention (eg, an exercise intervention). Such studies can take other measures to reduce the risk of bias, such as treating participants according to a strict protocol to reduce the risk of differential behaviours by persons administering the intervention.

Blinding of outcome assessors is equally important to reduce the introduction of bias into the assessments and should be attempted whenever possible. 5 Outcome assessments may be made by the participants themselves, by their healthcare providers, or by independent assessors. Blinding of the statistical analysts is achievable by simply labelling the participants' data with non-identifying codes. 5

How to implement blinding?

Blinding is not a simple procedure. The researchers often need to engage a variety of approaches to enhance blinding. Boutron et al 6 conducted a systematic review of methods used in pharmacological RCTs to establish blinding of patients and/or healthcare providers. These included providing treatments in identical form, specific methods to mask characteristics of the treatments (eg, added flavour or colour), or use of double dummy procedures and even simulation of an injection.

Methods to avoid unblinding involved use of active placebo, centralised assessment of side effects, and patients informed only in part about the potential side effects of each treatment. Some of the methods used for blinding outcome assessors included centralised assessment of complementary investigations, clinical examination that involved the use of video, audiotape or photography, and adjudication of clinical events. Clearly there are ethical considerations to blinding. All blinding approaches should be explained as part of the method and receive ethical approval from research ethics boards.

How to assess if blinding has been successful?

An attempt to blind participants and personnel does not always ensure successful blinding in practice. For example, for many blinded drug trials, the side effects of the drugs can reveal group allocation, unless the study compares two rather similar interventions (eg, drugs with similar side effects, or uses an active placebo. 6 It has been suggested that it would be useful to ask trial participants at the end of the trial to guess which treatment they have received, 7 , 8 and some reviews of such reports have been published. 7 , 9 Evidence of correct guesses exceeding 50% would suggest that blinding may have been broken. However, responses may simply reflect the patients' experiences in the trial. A good outcome will tend to be more often attributed to an active treatment, and a poor outcome to a placebo. 10

Risk of bias may be high for some outcomes and low for others. For example, knowledge of the assigned intervention may impact on behavioural outcomes (eg, number of visits to their physicians), while not impacting on physiological outcomes or mortality. Thus, assessments of risk of bias resulting from lack of blinding may need to be made separately for different outcomes. Rather than assessing risk of bias for each outcome separately, it is often convenient to group outcomes with similar risks of bias. For example, there may be a common assessment for all subjective outcomes (eg, quality of life) that is different from objective outcomes (eg, blood work). 11

In summary, when considering the effectiveness of blinding in reducing the risk of bias, it is important to consider specifically:

Were the participants and study personnel blinded or not blinded?

Who assessed the outcomes and were they blinded or not blinded?

What was the risk of bias in the outcome assessment considering the subjectivity or objectivity of an outcome? 11

  • Hróbjartsson A ,
  • Jørgensen KJ ,
  • Schulz KF ,
  • Chalmers I ,
  • Karanicolas PJ ,
  • Farrokhyar F ,
  • Boutron I ,
  • Estellat C ,
  • Guittet L ,
  • Fergusson D ,
  • Forfang E ,
  • Higgins JPT ,

Competing interests None

Read the full text or download the PDF:

Musculoskeletal Key

Fastest musculoskeletal insight engine.

  • MANUAL THERAPIST
  • MUSCULOSKELETAL MEDICINE
  • PHYSICAL MEDICINE & REHABILITATION
  • RHEUMATOLOGY
  • SPORT MEDICINE
  • Gold Membership

54 How to Limit Bias in Experimental Research

10.1055/b-0035-122054 54 How to Limit Bias in Experimental Research Paul J. Jenkins All scientific studies are subject to experimental error, and it is the duty of the investigator to eliminate it where possible, or reduce its impact if it cannot be completely removed. Error may jeopardize the validity of research findings, which may result in useless or harmful treatments being recommended, a waste of limited research resources. Bias is a specific type of error that results in a consistently false result through a systemic flaw in the experiment′s methodology. It has been increasingly recognized that clinical and basic scientific research may be subject to significant biases that jeopardize their findings. 1 , 2 In order to understand and reduce potential sources of error, an investigator requires a sound understanding of the experimental process ( Fig. 54.1 ). Some study types are inherently more at risk of error than others. Inclusion and exclusion criteria are selected, or a suitable animal model, organ, tissue, cell line, or other biological material is chosen. An outcome measure is selected, and comparison of this outcome measure is made between study groups. The methodology of group allocation is critical to performing a sound experiment. The study results are analyzed, written up, and reported via posters, conference presentations, and papers. These are then disseminated to other researchers via journals and research databases. These results may then be further assimilated by meta-analysis. Fig. 54.1 Experiment process and sources of bias. 54.1 New Techniques in Musculoskeletal Research Jargon Simplified: Finite Element Analysis Finite element analysis is a technique that is used to examine the internal stresses, strains, and deformation of materials under loads. It can be applied to simple homogenous materials, or complex heterogeneous biological tissues. Each structure is composed of a multitude of smaller elements that may be rectangles, cubes, or tetrahedrons. They are linked by “nodes” at their edges. The properties of each element and their influence on their neighbor is calculated for the overall structure. This requires intensive computational power for complex materials. Specific biological tissues can be modeled through the creation of a “mesh” from three-dimensional imaging such as computed tomography. The results predicted by the models can be compared to in vitro testing of the tissue under controlled conditions as part of the validation process. Musculoskeletal research has been enhanced by the development of new techniques to explore biochemical processes, genes, and proteins. New imaging techniques have allowed more detailed structure of cellular and extracellular structure. These new techniques are subject to experimental error, and the investigator should recognize the potential sources of error. Biomechanical research also regularly investigates the material properties of tissues and implants. Design and testing of novel implants requires techniques to evaluate strength, fatig-ability, and wear in both in vitro and in vivo scenarios. Finite element analysis, while offering myriad new possibilities to understand loading and stress distribution in biological tissues and implants, has specific sources of error and bias that must be recognized. Techniques in molecular biology and biomechanics are constantly evolving, and it is impossible to provide complete coverage of each of them, along with specific biases. The aim of this chapter is to discuss the types of bias that may occur in research, with particular reference to musculoskeletal research. It will also discuss techniques to reduce these through study design and risk assessment. Using this knowledge, researchers will be able to reduce the influence of bias through careful planning of their research and experiments. They will also be able to analyze new techniques for potential sources of bias. 54.1.1 Key Concepts: Types of Experimental Error Random error occurs when there is chance variation between individuals or specimens. Biological processes may be influenced by myriad factors, and no two specimens or sets of environmental conditions are identical. These errors can only be reduced by the ubiquitous techniques of increasing repetition, multiple samples, and averaging of results. Statistical techniques exist that may be used to decide whether a difference between groups is a result of chance or represents a real effect. Random errors are generally reduced by increasing the sample size. Sample sizes are limited, however, by experimental design, technical, time, and funding considerations. A researcher can predict the sample size required by performing a power calculation. A power calculation generally requires an estimate of the dispersion of the characteristic in the sample (such as standard deviation) and the minimal detectable difference expected to be measured. In situations where these are not able to established from existing research, a pilot study may be required. Systematic error is otherwise known as bias. 3 While random errors can be reduced by repeated experiments, biased studies can never be improved by repetition or sample size. Bias is the tendency of an experiment to produce a finding that is not consistent with the actual situation, through the methodology of performing the experiment. There are five main types of bias and they are influential at different stages of the research path. 54.1.2 Key Concepts: Main Types of Bias Selection bias: Systematic error introduced in choosing the experimental population and dividing it into groups Performance bias: Systematic error introduced carrying out the experiment Measurement bias: Systematic error introduced assessing outcome Analytical and interpretation bias: Systematic error introduced during the analysis of data and interpretation of results Publication bias: Systematic error introduced during the submission and presentation of the findings 54.2 Types of Bias 54.2.1 Selection Bias Selection bias occurs where experimental subjects or specimens are divided into different intervention groups. The optimum situation would for each group to be completely identical apart from the characteristic of being deliberately altered between groups. This could refer to animal model age and gender or cell line characteristics. Confounding is a particular result of selection bias, where a second, unmeasured characteristic is linked with group selection and may also influence a group to have an association with the outcome measure. In animal models that require the induction of a disease state or performance of a surgical procedure, knowledge of the subsequent study group may influence the researcher performing this step, thereby introducing bias. 3 An experiment may also be planned on tissue obtained from live patients (such as ligament, tendon, cartilage, or bone). In such cases, other forms of selection bias may occur, similar to other clinical studies. Sample bias is a form of selection bias that results from the sample population having different characteristics from a target population such as age, gender, or comorbidity. Sample bias is one of the most common biases present in basic science research, where the tissue or model is not adequately representative of the in vivo process and the generaliz-ability of results is limited. Referral bias may occur if only a proportion of those in the target population, meeting the inclusion criteria, are considered for the research study. Certain groups of patients may be more likely to participate in research than others, and this may introduce participant bias where the results will have a tendency to reflect the situation in those more likely to participate in research. Some of these effects may be negligible, but often they are unknown and inestimable. Selection bias can also result from loss to follow-up or post hoc exclusion of subjects. In basic science research, this may occur where a technique does not work as planned in a particular group. If these results are not included, the results may be biased toward or away from the group in which the technique worked. Reducing Selection Bias Control of selection bias is easier in prospective experiments where the target population or specimen specification can be tightly controlled to ensure uniformity between group. In retrospective studies, or studies where experimentation is planned on tissue obtained from patients during a study, selection bias is primarily controlled through rigorous study design and participant selection. In animal studies, the specimens should be of the same age and gender at the time of the study. If the disease process is to be induced prior to intervention, this should be performed without knowledge of the subsequent intervention group. Randomization is one of the most powerful methods of reducing selection bias. It can reduce the effect of unequal distribution of known and unknown characteristics between groups. As such, it is also effective in reducing random error. Practically, this can be performed using many techniques that include sealed envelopes and computerized randomization services. The researcher should be unaware of the randomization sequence. Randomization should occur at the time of intervention. The use of quasirandomization techniques based on study sequence or day of week (among others) offer less protection against bias. Block randomization can be used to ensure that groups remain balanced at the end of a predetermined number of participants (i.e., six or eight) to assist in resource utilization. This may be important if a procedure- or time-intensive assay was required by the protocol. Stratification and minimization are techniques used in the randomization procedure to account for known confounding characteristics, where it is extremely desirable to ensure equal distribution between groups. Decisions regarding stratification and minimization should be taken during the development of the protocol if particularly powerful confounders are recognized. The advice of biostatistician is recommended for these more advanced techniques of randomization because they can have significant implications on power calculations (as studies require to be powered to demonstrate differences between subgroups). Jargon Simplified: Allocation Concealment Allocation concealment is the practice of protecting the randomization sequence from prior knowledge to the participants or researchers. This could result from knowledge of random number tables used to generate the sealed envelopes. For this reason, the use of external randomization techniques (which are more accessible in days of Internet access) or sequential numbering of sealed envelopes to prevent tampering with the sequence are used to prevent selection bias.

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)

Related posts:

Default Thumbnail

Stay updated, free articles. Join our Telegram channel

Comments are closed for this page.

eliminating bias in experiments

Full access? Get Clinical Tree

eliminating bias in experiments

Sciencing_Icons_Science SCIENCE

Sciencing_icons_biology biology, sciencing_icons_cells cells, sciencing_icons_molecular molecular, sciencing_icons_microorganisms microorganisms, sciencing_icons_genetics genetics, sciencing_icons_human body human body, sciencing_icons_ecology ecology, sciencing_icons_chemistry chemistry, sciencing_icons_atomic &amp; molecular structure atomic & molecular structure, sciencing_icons_bonds bonds, sciencing_icons_reactions reactions, sciencing_icons_stoichiometry stoichiometry, sciencing_icons_solutions solutions, sciencing_icons_acids &amp; bases acids & bases, sciencing_icons_thermodynamics thermodynamics, sciencing_icons_organic chemistry organic chemistry, sciencing_icons_physics physics, sciencing_icons_fundamentals-physics fundamentals, sciencing_icons_electronics electronics, sciencing_icons_waves waves, sciencing_icons_energy energy, sciencing_icons_fluid fluid, sciencing_icons_astronomy astronomy, sciencing_icons_geology geology, sciencing_icons_fundamentals-geology fundamentals, sciencing_icons_minerals &amp; rocks minerals & rocks, sciencing_icons_earth scructure earth structure, sciencing_icons_fossils fossils, sciencing_icons_natural disasters natural disasters, sciencing_icons_nature nature, sciencing_icons_ecosystems ecosystems, sciencing_icons_environment environment, sciencing_icons_insects insects, sciencing_icons_plants &amp; mushrooms plants & mushrooms, sciencing_icons_animals animals, sciencing_icons_math math, sciencing_icons_arithmetic arithmetic, sciencing_icons_addition &amp; subtraction addition & subtraction, sciencing_icons_multiplication &amp; division multiplication & division, sciencing_icons_decimals decimals, sciencing_icons_fractions fractions, sciencing_icons_conversions conversions, sciencing_icons_algebra algebra, sciencing_icons_working with units working with units, sciencing_icons_equations &amp; expressions equations & expressions, sciencing_icons_ratios &amp; proportions ratios & proportions, sciencing_icons_inequalities inequalities, sciencing_icons_exponents &amp; logarithms exponents & logarithms, sciencing_icons_factorization factorization, sciencing_icons_functions functions, sciencing_icons_linear equations linear equations, sciencing_icons_graphs graphs, sciencing_icons_quadratics quadratics, sciencing_icons_polynomials polynomials, sciencing_icons_geometry geometry, sciencing_icons_fundamentals-geometry fundamentals, sciencing_icons_cartesian cartesian, sciencing_icons_circles circles, sciencing_icons_solids solids, sciencing_icons_trigonometry trigonometry, sciencing_icons_probability-statistics probability & statistics, sciencing_icons_mean-median-mode mean/median/mode, sciencing_icons_independent-dependent variables independent/dependent variables, sciencing_icons_deviation deviation, sciencing_icons_correlation correlation, sciencing_icons_sampling sampling, sciencing_icons_distributions distributions, sciencing_icons_probability probability, sciencing_icons_calculus calculus, sciencing_icons_differentiation-integration differentiation/integration, sciencing_icons_application application, sciencing_icons_projects projects, sciencing_icons_news news.

  • Share Tweet Email Print
  • Home ⋅

How to Eliminate Bias in Qualitative Research

Eliminating bias in a research project can be difficult.

Steps & Procedures for Conducting Scientific Research

Qualitative research is a type of scientific investigation that aims to provide answers to a question without bias. It uses predetermined procedures such as interviewing participants to collect information and produce findings. Biases occur naturally in the design of your research, but you can minimize their impact by recognizing and dealing with them. An impartial qualitative research project respects the dignity of the research participants, observes fundamental principles of ethics and takes all of the variables into account.

Avoid design problems by understanding the limitations of the sample group. For example, if you are researching the health benefits of a certain food, be aware if only females or people over a particular age are involved. Bias can occur when certain groups are left out. Account for any unavoidable omission bias by changing the experimental design.

Ensure that the research participants are independent and treated with respect so that they are protected from exploitation. This ensures that people are not selected based on a desire to prove a specific research objective. Avoid becoming focused on one viewpoint when observing participants as this endangers the impartiality of the research.

Allow the research participants enough time to complete questionnaires. Procedural bias can occur if you put too much pressure on them. For example, employees who are asked to complete a survey during a coffee break are more likely to skim through the questions without reading them properly.

Be aware of errors in data collection and measuring processes. For example, when collecting information on prejudice against people of other races, know that most people are reluctant to give answers in an interview because they fear being judged and appearing racist. Researchers often deal with measurement bias by using numerous interviews and an anonymous questionnaires. They recognize that people will tell the interviewer what they think he wants to hear instead of the truth.

Review all the variables arising from the experiment to ensure that there are no experimental errors. False positives and negatives will create biased results.

Ensure that the results of the research are accurately recorded in literature to avoid reporting bias. Show that you understand that certain biases exist and that you have made every effort to consider this in the analysis and statistics.

Seek training and certification in research ethics before beginning the preliminary work and data collection of qualitative research.

Be wary of bias in the findings of research on the Internet. Some research companies hide some research and promote others with more positive results.

Related Articles

Five characteristics of the scientific method, what is a constant in the scientific method, how to calculate significance, what is the difference between a control & a controlled..., music science fair project ideas, how to do a quantitative research questionnaire, what is the purpose of factor analysis, laboratory observation methods, how to calculate p-hat, how do i create graph results for questionnaires, the definition of an uncontrolled variable, experiments on which mouthwash kills bacteria, what is the meaning of variables in research, what is the next step if an experiment fails to confirm..., differences between conceptual independent variables..., how to calculate success rate, how to minimize a sampling error, what are the advantages & disadvantages of using ordinal....

About the Author

Carola Finch began freelancing for newspapers and magazines in 1976. She specializes in writing about people with disabilities, business, Christianity and social issues. Finch studied journalism and communications at Red River Community College.

Photo Credits

people image by cloud1971 from Fotolia.com

Find Your Next Great Science Fair Project! GO

  • Skip to primary navigation
  • Skip to main content
  • Skip to footer

Science Struck

Science Struck

What is Experimenter Bias and How to Avoid It?

Experimenters bias is a research phenomenon where in a researcher or an experimenter's resolution is biased. Read more.

Like it? Share it!

What is Experimenter Bias and How to Avoid It?

Experimenters bias is a research phenomenon where in a researcher or an experimenter’s resolution is biased. Read more.

Robert Rosenthal , a renowned psychologist was among the first one’s to exhibit experimenters bias as ‘experimenter expectancies’. This had not caught a lot attention till the early 1960s.

What is Experimenter Bias?

The term experimenter bias is related to the researcher’s influence on the outcome of his research. When researchers choose their topic of research there is a probable outcome that they have predicted in their minds. In psychology this is termed as ‘observer-expectancy effect’. Because of the prediction of the outcome in advance, the research methodology or the way the outcome is analyzed or even the way it is interpreted can be influenced. This basically comes to the point where researches that have been taken up with the influence of a probable result in mind, will not derive a precise result. This makes the research unreliable. The judgment or the conclusion is deviated form what the actual outcome should be.

Experimenter bias can occur in any facet of a research. Researchers bias can influence their literature review, the study sample they have taken, also the method of analyzing the data, and even in representing the outcome of the research.

Qualitative Research Bias

Usually a qualitative research bias is found in social science researches. This is because it is not as accurate as physical science. Social science, studies more of behavioral patterns, and measuring a behavior or even quantifying it is not possible. The way qualitative methods are used are also different. Hence, results of a social study research will be based on what the researcher’s interpretation is. Experimenter bias in social behavioral studies is sometimes ineluctable. Like in the case of economics, a study by a Marxist’s view on a capitalism and Keynesian’s perspective of mixed economy, both will have biased interpretations of the outcome of the same research. In case of physical science also, the outcomes can be biased. This happens when researchers round up a figure, making the conclusion inaccurate.

Quantitative Research Bias

Quantitative bias is associated with choosing a wrong sample or a wrong way of analysis. A wrong sample would be a biased sample. Let’s say you are trying to research on a company’s work policies, but for the review you take a survey of only the women at work and not the men. In this case, the sample is biased as it does not show men’s opinion. Also, if a sample is small then, again the research’s outcome would be biased. Even choosing a wrong or an inaccurate way of data analysis could lead to a quantitative bias.

How to Avoid Experimenter Bias?

Experimenter bias is a human incompetency of being objective and inciting towards subjectivity.

Double-Blind Design

This is the most common and efficient technique used by researchers. Here the researchers are to be kept aloof from what the research participants outcome could be. Also, the research participants are not allowed to interact with the researcher to know what his perspective on the research is.

Mechanize Procedures

Nothing is as magnificent as a human brain therefore to avoid the complexities a brain goes through, it is best to mechanize your research procedures. Use a computer to store and manage data and use analytical software to analyze data. A computer will not give you a biased answer, it will mathematically solve whatever is put before it.

Research Protocols

A recommended format for a research protocol should be adhered to by the students. This makes the research systematic and will reduce the scope of being biased. The researcher will have to stepwise follow the protocol and that will prevent them from jumping to a conclusion or a research methodology directly.

Investigator Intervention

An investigator should plan the research for the researcher instead of the entire liberty lying in the hands of the experimenter. The researcher should be told what to do and what not to do.

Consider More Aspects

While looking out at the conclusion, not only the cause and effect of independent and dependent variables should be considered but also what effect does it have on other indirectly related variables should be mentioned.

Experimenters are bound to maneuver their research according to what their interpretation and understanding is for the subject. Though when the research investigators are mentoring and tracking the researchers movements, intensity of the prejudice can be reduced.

Get Updates Right to Your Inbox

Privacy overview.

Sampling Bias: Types, Examples & How to Avoid It

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Sampling bias occurs when a sample does not accurately represent the population being studied. This can happen when there are systematic errors in the sampling process, leading to over-representation or under-representation of certain groups within the sample.

Sampling bias results in biased samples of a population where all individuals were not equally likely to have been selected and thus do not accurately represent the entire group.

Sampling bias compromises the external validity of findings by failing to accurately represent the population, restricting the generalization of results only to groups that share characteristics with the sample.

Sample Target Population

In medical fields, sampling bias is ascertainment bias, where one category of participants is over-represented in the sample.

Sampling bias is problematic because it leaves out important research data, threatening external validity. The results from research completed with a sampling bias are misleading and exclude valuable data.

This limits the generalizability of your findings because findings from biased samples can only be generalized to populations that share characteristics with the sample. Thus the results from the research cannot be used to express the ideas and thoughts of the majority. 

When there is sampling bias in your study, differences between the samples from a population and the entire population they represent are not due to chance but rather due to this bias.

Correcting or reducing sampling bias is important during the research because the population will not be accurately represented if the sample bias is not addressed. 

It is important to note that sampling bias occurs during data collection and refers to the method of sampling, not the sample itself. Additionally, sampling bias often happens without the researcher’s knowledge.

Imagine you want to study the prevalence of depression amongst undergraduate students at your university. You send out an email to the undergraduate student body asking for volunteers to participate in your study.  

This method will lead to sampling bias because only the people who are open to talking about their depression will sign up to participate.

This is an example of voluntary response bias because only those individuals who are willing to talk about their experiences with depression will agree to take part in a study, making the participants a non-representative sample.

Undercoverage Bias

Undercoverafe bias occurs when some population members are inadequately represented in the sample.

For example, administering a survey online will exclude groups with limited internet access, such as the elderly and those in lower-income households.

Voluntary Response Bias / Self-Selection Bias

Self-selection bias is a type of bias that occurs when participants can choose whether or not to participate in the project.

Bias arises because people with specific characteristics might be more likely to agree to participate in a study than others, making the participants a non-representative sample.

For example, people with strong opinions or substantial knowledge about a specific topic may be more willing to spend time answering a survey than those without. 

Survivorship Bias 

Survivorship bias refers to when researchers focus on individuals, groups, or observations that have passed some sort of selection process while ignoring those who did not.

In other words, only “surviving” subjects are selected. For example, in finance, failed companies tend to be excluded from performance studies because they no longer exist.

This causes the results to skew higher because only companies that were successful enough to survive are included.

Non-Response Bias

Non-response bias is a type of bias that arises when people who refuse to participate or drop out of a study systematically differ from those who take part.

For example, if conducting a study on the prevalence of depression in a community, your results may be an underestimation if those with depression are less likely to participate than those without depression.

Recall Bias

Recall bias occurs when some members of your sample cannot remember important details accurately. As a result, they might provide incomplete or incorrect information that can distort your research findings.

This type of bias tends to affect retrospective surveys that rely on self-reported data. 

Exclusion Bias 

This bias results from intentionally excluding a particular group from the sample. Exclusion bias is closely related to non-response bias. 

Observer Bias

Observer bias refers to the tendency of observers not to see what is there, but instead to see what they expect or want to see.

This bias can result in an overestimation or underestimation of what is true and accurate, which compromises the validity of your research findings.

For example, researchers might unintentionally influence participants during interviews by focusing on specific statistics that tend to support the hypothesis instead of those that do not.

A common cause of sampling ties lies in the study’s design or the data collection process, as researchers may favor or disfavor collecting data from certain individuals or under certain conditions.

Sampling bias also tends to arise when researchers adopt sampling strategies based on judgment or convenience. 

This type of bias can occur in both probability and non-probability sampling. 

In probability sampling, every member of the population has an equal chance of being selected (i.e., using a random number generator to select a random sample from a population). While probability sampling tends to reduce the risk of sampling bias, it typically does not eliminate it completely.

Extracting random samples typically requires a sampling frame, or a list of units of the whole population from which the sample is drawn. However, using a sampling frame does not necessarily prevent sampling bias. If your sampling frame does not match the population, this can result in a biased sample.

This can happen when a researcher fails to correctly determine the target population or uses outdated and incomplete information, thus excluding sections of the target population.

Or, even when the sampling frame is selected properly, sampling bias can arise from non-responsive sampling units (i.e., if certain classes of subjects are more likely to refuse to participate).

Mismatches between the sampling frame and the target population, as well as non-responses, can result in a biased sample.  

In non-probability sampling, samples are selected based on non-random criteria, such as with convenience sampling where participants are selected based on accessibility or availability.

These sampling techniques often result in biased samples because some population members are more or less likely to be included than others.

How to Avoid Sampling Bias

  • Use random or stratified sampling → Stratified random sampling will help ensure you get a representative research sample and reduce the interference of irrelevant variables in your systematic investigation.
  • Avoid convenience sampling → Rather than collecting data from only easily accessible or available participants, you should gather data from the different subgroups that make up your population of interest. 
  • Clearly define a target population and a sampling frame → Matching the sampling frame to the target population as much as possible will reduce the risk of sampling bias. 
  • Follow up on non-responders → When people drop out or fail to respond to your survey, do not ignore them, but rather follow up to determine why they are unresponsive and see if you can garner a response.  Additionally, you should keep close tabs on your research participants, and follow up with them frequently to reduce attrition.
  • Oversampling → Oversampling can be used to avoid sampling bias in cases where members of the defined population are underrepresented.
  • Aim for a large research sample →  The larger your sample population, the more likely you are to represent all subgroups from your population of interest.  
  • Set up quotas for each identified demographic → If you think participant gender, age, ethnicity or some other demographic characteristic is a potential source of bias within your study, quotas will allow you to evenly sample people from different demographic groups within the study.

What is the difference between sampling bias and sampling error?

Sampling error is a statistical error that occurs when the sample used in the study is not representative of the whole population. So, sampling error occurs as a result of sampling bias.

What is the difference between sampling bias and response bias?

Sampling bias occurs when some members of a population are systematically more likely to be selected in a sample than others and thus the sample does not accurately represent the entire group.

Response bias is a general term that refers to a wide range of conditions or factors that can lead participants to respond inaccurately or falsely to questions.  

For example, there could be something about how the actual survey questionnaire is constructed that encourages a certain type of answer, leading to measurement error.

Which type of sampling is most at risk for sampling bias?

Non-probability sampling, specifically convenience sampling, is most at risk for sampling bias because with this type of sampling, some members of the population are more likely to be included than others.

Does sampling bias affect reliability?

Yes, sampling bias distorts the research findings and leads to unreliable outcomes. It also is a threat to external validity because the results from a biased sample may not generalize to the population.

Why is it important to avoid sampling bias in research?

It is important to avoid sampling bias in research because otherwise, the population of interest will not be accurately represented. If the sample bias is not addressed then, your research loses its credibility.

Is probability sampling biased?

While probability sampling can significantly reduce sampling bias by  giving every member of the population an equal chance of being included in the research, this method can still result in a biased sample if your sampling frame does not match the population of interest.

Can sampling error be calculated?

Yes, sampling error is calculated by dividing the standard deviation of the population by the square root of the size of the sample, and then multiplying the resultant with the confidence level. 

Here’s the formula for calculating sampling error:

Sampling error = confidence level × [standard deviation of population / (square root of sample size)]

Further Reading

Hamill, R., Wilson, T. D., & Nisbett, R. E. (1980). Insensitivity to sample bias: Generalizing from atypical cases .  Journal of Personality and Social Psychology ,  39 (4), 578.

Nielsen, M., Haun, D., Kärtner, J., & Legare, C. H. (2017). The persistent sampling bias in developmental psychology: A call to action .  Journal of Experimental Child Psychology ,  162 , 31-38.

Print Friendly, PDF & Email

Learning Materials

  • Business Studies
  • Combined Science
  • Computer Science
  • Engineering
  • English Literature
  • Environmental Science
  • Human Geography
  • Macroeconomics
  • Microeconomics

Bias in Experiments

We've all experienced some form of bias in one way or the other. You may have seen it happen to others, experienced it yourself or even participated in it. Bias here means favoring something over another even when the thing being favored does not deserve to be.

Bias in Experiments

Create learning materials about Bias in Experiments with our free learning app!

  • Instand access to millions of learning materials
  • Flashcards, notes, mock-exams and more
  • Everything you need to ace your exams

Millions of flashcards designed to help you ace your studies

  • Cell Biology

What is bias in experiment?

What is the placebo effect?

What is blinding?

What is double blinding?

What is the reason for blinding?

List some sources of bias in experiments.

List some types of bias.

List some ways of avoiding bias in experiments.

List some advantages of eliminating bias in experiments.

Explain the participant bias.

Review generated flashcards

to start learning or create your own AI flashcards

Start learning or create your own AI flashcards

  • Applied Mathematics
  • Decision Maths
  • Discrete Mathematics
  • Logic and Functions
  • Mechanics Maths
  • Probability and Statistics
  • Bayesian Statistics
  • Binomial Distribution
  • Binomial Hypothesis Test
  • Biostatistics
  • Bivariate Data
  • Categorical Data Analysis
  • Categorical Variables
  • Causal Inference
  • Central Limit Theorem
  • Chi Square Test for Goodness of Fit
  • Chi Square Test for Homogeneity
  • Chi Square Test for Independence
  • Chi-Square Distribution
  • Cluster Analysis
  • Combining Random Variables
  • Comparing Data
  • Comparing Two Means Hypothesis Testing
  • Conditional Probability
  • Conducting A Study
  • Conducting a Survey
  • Conducting an Experiment
  • Confidence Interval for Population Mean
  • Confidence Interval for Population Proportion
  • Confidence Interval for Slope of Regression Line
  • Confidence Interval for the Difference of Two Means
  • Confidence Intervals
  • Correlation Math
  • Cox Regression
  • Cumulative Distribution Function
  • Cumulative Frequency
  • Data Analysis
  • Data Interpretation
  • Decision Theory
  • Degrees of Freedom
  • Discrete Random Variable
  • Discriminant Analysis
  • Distributions
  • Empirical Bayes Methods
  • Empirical Rule
  • Errors In Hypothesis Testing
  • Estimation Theory
  • Estimator Bias
  • Events (Probability)
  • Experimental Design
  • Factor Analysis
  • Frequency Polygons
  • Generalization and Conclusions
  • Geometric Distribution
  • Geostatistics
  • Hierarchical Modeling
  • Hypothesis Test for Correlation
  • Hypothesis Test for Regression Slope
  • Hypothesis Test of Two Population Proportions
  • Hypothesis Testing
  • Inference For Distributions Of Categorical Data
  • Inferences in Statistics
  • Item Response Theory
  • Kaplan-Meier Estimate
  • Kernel Density Estimation
  • Large Data Set
  • Lasso Regression
  • Latent Variable Models
  • Least Squares Linear Regression
  • Linear Interpolation
  • Linear Regression
  • Logistic Regression
  • Machine Learning
  • Mann-Whitney Test
  • Markov Chains
  • Mean and Variance of Poisson Distributions
  • Measures of Central Tendency
  • Methods of Data Collection
  • Mixed Models
  • Multilevel Modeling
  • Multivariate Analysis
  • Neyman-Pearson Lemma
  • Non-parametric Methods
  • Normal Distribution
  • Normal Distribution Hypothesis Test
  • Normal Distribution Percentile
  • Ordinal Regression
  • Paired T-Test
  • Parametric Methods
  • Path Analysis
  • Point Estimation
  • Poisson Regression
  • Principle Components Analysis
  • Probability
  • Probability Calculations
  • Probability Density Function
  • Probability Distribution
  • Probability Generating Function
  • Product Moment Correlation Coefficient
  • Quantile Regression
  • Quantitative Variables
  • Random Effects Model
  • Random Variables
  • Randomized Block Design
  • Regression Analysis
  • Residual Sum of Squares
  • Robust Statistics
  • Sample Mean
  • Sample Proportion
  • Sampling Distribution
  • Sampling Theory
  • Scatter Graphs
  • Sequential Analysis
  • Single Variable Data
  • Spearman's Rank Correlation
  • Spearman's Rank Correlation Coefficient
  • Standard Deviation
  • Standard Error
  • Standard Normal Distribution
  • Statistical Graphs
  • Statistical Inference
  • Statistical Measures
  • Stem and Leaf Graph
  • Stochastic Processes
  • Structural Equation Modeling
  • Sum of Independent Random Variables
  • Survey Bias
  • Survival Analysis
  • Survivor Function
  • T-distribution
  • The Power Function
  • Time Series Analysis
  • Transforming Random Variables
  • Tree Diagram
  • Two Categorical Variables
  • Two Quantitative Variables
  • Type I Error
  • Type II Error
  • Types of Data in Statistics
  • Variance for Binomial Distribution
  • Venn Diagrams
  • Wilcoxon Test
  • Zero-Inflated Models
  • Theoretical and Mathematical Physics

Aside from our everyday lives, bias also occurs during experiments and research. In this article, you will learn about the sources, types and examples of bias in experiments.

Before going in-depth, let's see what bias in experiments means.

Bias in experiments refers to a known or unknown influence in the experimental process, data or results.

Bias can come from anywhere. It can be from the scientist conducting the experiment, the participants of the experiments or it may come from the way the experiment is being conducted. Before we go in-depth, let's learn about something called the placebo effect.

Placebo Effect and Blinding

The placebo effect is used all the time especially in the medical sector.

A placebo is a medicine or procedure that has no active substance and no real effect.

It involves receiving a treatment that causes improvement (or possibly side effects) even when its fake. The placebo effect is used to test the effectiveness of a treatment and if the real treatment performs much better than the placebo, then you know it really works. The participants getting the placebo should think they are getting the real thing. Otherwise, the effect may not be felt.

The participants must be blind to which type of treatment they are getting. Since the participant of the treatment should not know what type they are getting, something has to be done to make sure that it is so.

Blinding means to keep information from someone about the type of treatment they are getting.

It is possible for the person administering the treatment to subconsciously give cues that may make the patient or participant know that something is wrong. For this reason, both the participant and the person administering the treatment should not know if its a placebo or not. This is called double blinding . When the patient or participant is the only one unaware of the type of treatment received, it is called single blinding.

When the placebo effect works, it doesn't mean that the illness was false. One thing that has happened is that the mind and body of the person is relaxed knowing that it is taking some kind of medication. Some symptom causing hormones may reduce as a result and the body begins to act the way it should without the illness.

That is why experiments use a control group.

A control group is a group that does not receive any treatment during an experiment.

For the placebo effect, one group receives the treatment while the other group receives an inactive treatment but here, one group receives the treatment but the other receives nothing at all. Let's take a look at the sources of bias in experiments.

Sources of Bias in Experiments

As earlier stated, you have bias in experiments when the experimental process is knowingly or unknowingly influenced, affecting the outcome of the experiment. Bias can come from different sources. It can come from the scientist, the participants of the experiment or the experimental environment.

Below are some sources of bias in experiments.

The method of data collection and the source of the data can lead to bias in experiments. To learn about the methods of data collection, see the article on Methods of Data Collection .

Not considering all possible outcomes can lead to bias. Even though, it is not really possible to consider all outcomes, scientist should make an effort to perform more experiments to control any new source of bias found.

  • Unknown changes in the experimental environment can lead to bias.
  • False behavior and response from the participants can lead to bias.

Let's now see some types of bias in experiments.

Types of Bias in Experiments.

The following are some types of bias.

  • Participant or Selection Bias.
  • Publication Bias.
  • Confirmation Bias
  • Observation Bias
  • Confounding Bias.
  • Design Bias.

Let's see what each of them are about.

Participant or Selection Bias

Participant bias has to do with the population. It occurs when a certain group of people are selected to participate in an experiment or research. This group of people maybe of the same age, same gender or may have the same characteristics or behavior. The problem here is that only one category of the population is considered. The experiment will not cover the effect on the rest of the population.

For example, if you have a new vaccine that you want to test and you test it only on healthy people who are between the ages of 20 to 30 years old. The data you will get from this test cannot tell how effective the vaccine will be on the entire population. Your test does not give you information on the effect on people younger than 20, people older than 30 or people with underlying health conditions. The data from this experiment is not sufficient to release this vaccine to the public.

The way to avoid participant bias is to include various categories of people while conducting your experiment. You have to make sure that all possible beneficiaries of your experiment are investigated to know the effect on them.

Publication Bias

Publication bias occurs when only the positive or interesting aspect of a scientific study is published. There are many reasons why this happens. One reason is because people are more likely to accept your findings or product when they feel it will do little or no harm to them.

This bias is seen a lot in the medical sector when a new drug or treatment method is being introduced. Sometimes, they want to down-play the negative effect of what they are proposing so it can be accepted. That is why in the US when you see an ad for a new drug it has to list all of the possible side effects for the drug.

Another reason for publication bias is the standard and criteria that has been set for the publication of research papers in a certain fields. Some of these criteria may require you to leave out some information or down-play some things. The authors of these papers make adjustments so that their papers can be published.

One other reason is that those conducting the experiment may want to favor those funding the experiment thereby omitting information, especially the negative ones that may harm those funding.

Publication bias leads to limited information and understanding of a particular topic. It can also negatively affect the health and quality of living of the public.

Confirmation bias occurs when you are carrying out an experiment for the purpose of confirming your hypothesis. The problem here is that you would want your hypothesis to be true. So, you unconsciously follow procedures and seek information that will confirm your hypothesis. You ignore everything that will say otherwise. This can lead to wrong conclusion.

You avoid this by considering all options during your experimental process and keep the possibility of your hypothesis being wrong in mind.

Observation bias is seen in experiments where scientist observe the behavior of the participants. Sometimes, the participants knowingly or unknowingly act or behave in ways they would normally not behave because they know that they are being watched. Their false behavior will lead to incorrect conclusion of the experiment.

Confounding Bias

Confounding bias is a type of bias that is as a result of an external factor affecting the relationship or association between a variable or subject that is being studied and its outcome. This external factor is called a confounder. The presence of the confounder affects the accuracy of the outcome.

Design Bias

Design bias affects the outcome or conclusion of the experiment. This happens as a result of the methods and the procedures you follow while conducting the experiment. To avoid design bias, the scientist need to keep in mind all other possible bias that can occur during the experiment process and try to avoid them.

Avoiding Bias in Experiments

Avoiding bias is often called controlling for sources of bias . The following are some ways in which you can avoid bias in experiments.

Ensure that the participants in your experiment represent all categories that are likely to benefit from the experiment.

  • Ensure that no important findings from your experiments are left out.
  • Consider all possible outcomes while conducting your experiment.
  • Make sure your methods and procedures are clean and correct.
  • Seek the opinions of other scientists and allow them review you experiment. They maybe able to identify things you have missed.
  • Collect data from multiple sources.
  • Allow participants to review the conclusion of your experiment so they can confirm that the conclusion accurately represents what they portrayed.

The hypothesis of an experiment should be hidden from the participants so they don't act in favor or maybe against it.

Advantages of Eliminating Bias in Experiments

Let's see some advantages of eliminating bias in experiments.

  • The results and conclusion of the experiment will be reliable and dependable.
  • There will be better chances of the experiment helping as much people as it should.
  • Important information and findings will not be hidden or left out.
  • The conclusion of the experiment will not be influenced by any specific opinion.
  • The scientist will be open minded and consider all possibilities while conducting the experiment.
  • The data collected will be more accurate.
  • Detailed and complete articles and journals for the experiment will be published.

Examples of Bias in Science Experiments

Let's take a look at some practical examples of bias in science experiments.

You have an hypothesis that artificial coloring of food causes hyperactivity in children. To investigate this, you take two groups of children and give one group fruits and the other group artificial colored sweets. The group of children that ate the artificial colored sweets were hyperactive which confirms the hypothesis.

What kind of bias can you identify in this experiment and explain why it is a bias?

The type of bias here is confirmation bias. You have not considered that those group of children were hyperactive because of the sugar they were consuming, or the fact that they haven't been getting much exercise, and not because of the artificial coloring.

Let's see another example.

You are conducting an experiment to see the effect of a particular supplement in young males. Over 60% of the participants are African Americans and the rest are Caucasians.

What kind of bias can you identify for this experiment and explain why it is a bias?

The bias here is participant or selection bias. With your participants, there is under representation and over representation of two groups of people and you have not even considered other races. Unless your research is exclusive to a particular race, your participants have to be diverse.

For the purpose of meeting some publication criteria or guidelines, you have decided to omit some useful information from your research.

What type of bias is this?

This type of bias is called publication bias.

Let's look at one more example.

You are trying to study the behavior of a group of people. The participants of the experiment are aware of the experiment hypothesis and they also know they are being watched. Because of this, they try to act in ways that they feel is acceptable.

What kind of bias can you identify here?

This type of bias is called observation bias. The hypothesis of an experiment should be hidden from the participants so they don't act in favor or against it.

Bias in Experiments - Key takeaways

  • Bias can come from anywhere. It can be from the scientist conducting the experiment, the participants of the experiments or it may come from the way the experiment is being conducted.

Flashcards in Bias in Experiments 17

A placebo is a medicine or procedure that has no active substance and no real effect. The placebo effect  involves receiving a treatment that causes improvement even when its fake.

Double blinding is when both the participant and the person administering the treatment does not know if its a placebo or not.

Blinding is used to ensure that the patient or participant is not aware that they are being given a placebo treatment.

  • The method of data collection and the source of the data can lead to bias in experiments. To learn about the methods of data collection, see our article on Methods of Data Collection.
  • Not considering all possible outcomes can lead to bias.
  • Adjusting the experiment results to fit an hypothesis is bias.

Bias in Experiments

Learn with 17 Bias in Experiments flashcards in the free StudySmarter app

We have 14,000 flashcards about Dynamic Landscapes.

Already have an account? Log in

Frequently Asked Questions about Bias in Experiments

How to avoid bias in experiments?

The following are some ways in which you can avoid bias in experiments.

  • Ensure that the participants in your experiment represents represent all categories that are likely to benefit from the experiment.
  • The hypothesis of an experiment should be hidden from the participants so they don't act in favor or maybe against it.

Explain the advantage of eliminating bias in experiments.

The advantage of eliminating bias in experiments is that it will lead to a clear and accurate result or conclusion of the experiment. The results obtained will be valid.

Which experimental procedures are designed to decrease bias in experiments? 

Random sampling is an experimental procedure designed to decrease bias in experiments. Random sampling ensures that the participants of an experiment are selected at random and they include every category of people that should be investigated in order to reach an accurate and valid conclusion. It helps to control participant bias.

How does bias affect an experiment?

Bias affects an experiment by influencing the experimental procedures and conclusion. This makes the conclusion inaccurate and unreliable.

What are ways to prevent bias in experiments?

The following are some ways to prevent bias in experiments.

Discover learning materials with the free StudySmarter app

1

About StudySmarter

StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

Bias in Experiments

StudySmarter Editorial Team

Team Math Teachers

  • 11 minutes reading time
  • Checked by StudySmarter Editorial Team

Study anywhere. Anytime.Across all devices.

Create a free account to save this explanation..

Save explanations to your personalised space and access them anytime, anywhere!

By signing up, you agree to the Terms and Conditions and the Privacy Policy of StudySmarter.

Sign up to highlight and take notes. It’s 100% free.

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Smart Note-Taking

Join over 22 million students in learning with our StudySmarter App

ScienceOxygen

How do you reduce bias in an experiment?

  • Create a thorough research plan.
  • Evaluate your hypothesis.
  • Ask general questions before specifying.
  • Place topics into separate categories.
  • Summarize answers using the original context.
  • Show responders the results.
  • Share analytical duties with the team.

Table of Contents

What are 3 ways to reduce bias?

  • Make sure employees understand stereotyping, the foundation for bias.
  • Set expectations.
  • Be transparent about your hiring and promotion process.
  • Make leaders responsible.
  • Have clear criteria for evaluating qualifications and performance.
  • Promote dialogue.

How does scientific method reduce bias?

You can eliminate bias in experiments with the help of a scientific approach because experiments require objectivity. The scientific method requires you to carefully record any experimental detail so that it can be mirrored and publicized. To accomplish this, the results of your experiment should be neutral.

How can the effects of bias be reduced?

  • Search relentlessly for potentially relevant or new disconfirming evidence.
  • Accept the “Chief Contrarian” as part of the team.
  • Seek diverse outside opinion to counter our overconfidence.
  • Reward the process and refrain from penalizing errors when the intentions and efforts are sound.

How do you regulate bias?

  • Accept that we all have unconscious biases.
  • Make considered decisions.
  • Monitor your behaviour.
  • Pay attention to bias related to protected characteristics.
  • Widen your social circle.
  • Set ground rules for behaviour.
  • Avoid making assumptions or relying on gut instinct.

What is reducing bias?

Bias is having a preference for something over another thing. The Law of Attraction is research that supports the idea that everyone has biases, even if they are often implicit. Ways to reduce bias towards something are to identify your biases, pursue empathy, increase diversity, and consciously act.

What is bias and how can it be reduced during interviews?

Interview bias occurs when the interviewer judges a candidate not only on their skills and competencies but on unspoken (and sometimes, unconscious) criteria hence making the interview less objective.

What is bias in biology?

In a scientific research study or clinical trial, a flaw in the study design or the method of collecting or interpreting information. Biases can lead to incorrect conclusions about what the study or clinical trial showed.

Can human bias ever be eliminated from research?

While it is probably impossible to eliminate bias, each person can strive to be aware of his or her preferences and alert to situations where the bias can be damaging to the science or ones colleagues.

What practice helps scientists avoid bias in their findings quizlet?

What practice helps scientists avoid bias in their findings? ​Quantitative results minimize the potential for bias.

How do you change bias?

  • Focus on seeing people as individuals.
  • Work on consciously changing your stereotypes.
  • Take time to pause and reflect.
  • Adjust your perspective.
  • Increase your exposure.
  • Practice mindfulness.

How do you disrupt a bias?

  • Increase access to diverse people and ideas.
  • Make hiring/advancement processes transparent and consistent.
  • Focus on culture add (instead of culture fit)
  • Encourage all team members to contribute.
  • Nurture soft skills.
  • Create metrics around diversity.
  • Be prepared for some resistance.

How do you avoid similarity bias?

  • Avoid using language that reinforces gender bias.
  • Avoid writing prescriptive job descriptions.
  • Anonymize resumes.
  • Put together a diverse interview panel.
  • Run unconscious bias training.
  • Script interviews.
  • Leverage skills testing.

What is bias in research?

In research, bias occurs when “ systematic error [is] introduced into sampling or testing by selecting or encouraging one outcome or answer over others” 7. Bias can occur at any phase of research, including study design or data collection, as well as in the process of data analysis and publication (Figure 1).

What are the 3 types of bias?

Three types of bias can be distinguished: information bias, selection bias, and confounding. These three types of bias and their potential solutions are discussed using various examples.

What causes bias in an experiment?

Biased Reporting It happens when the research results are altered due to personal beliefs, customs, attitudes, culture, and errors among many other factors. It also means that the researcher must have analyzed the research data based on his/her beliefs rather than the views perceived by the respondents.

Why is bias important in science?

Bias can cause the results of a scientific study to be disproportionately weighted in favor of one result or group of subjects. This can cause misunderstandings of natural processes that may make conclusions drawn from the data unreliable.

What is the ultimate cause of bias in science?

What is the ultimate cause of bias in science? Because human nature often. leads one to see and believe what he wants to be true, not necessarily what actually is true.

Why should biases be avoided?

Bias causes false conclusions and is potentially misleading. Therefore, it is immoral and unethical to conduct biased research. Every scientist should thus be aware of all potential sources of bias and undertake all possible actions to reduce or minimize the deviation from the truth.

What is done to eliminate bias in a scientific experiment quizlet?

Blinding: A control procedure that reduces bias by ensuring that data collectors and/or research participants do not have information that distorts perceptions or influences behavior (e.g., knowing whether individual study participants are in a control group or an experimental group).

What are two ways to reduce bias in your research quizlet?

  • Open Mindedness. Capable of accepting new and different ideas (although you are pro-life, being able to see and look over the views of someone who is pro-choice)
  • Critical Thinking.
  • Double Blind Procedure.
  • Standardization.
  • Operationalization.
  • Controlling for Confounds.
  • Representative Sampling.

What method can be used to avoid bias language when reporting research?

Use Third Person Point of View When writers use first person plural pronouns like we, us, and our, these words assume that the reader has the same experience or viewpoint as the writer. As this is not always the case, it is better to use third person pronouns.

Can bias be avoided?

To an extent it is true that bias can be avoided this way, but it is not true that it necessarily overcomes bias that arrises because we are human. The best strategy to avoid bias is by making ourselves aware of it.

How do you address bias?

  • Start with yourself! Reflect on your own stereotypes, prejudices, and discrimination.
  • Educate yourself. A few great resources:
  • Practice mindfulness. Pay attention to the thoughts and associations you have about people with different characteristics and identities.

How do you overcome bias in healthcare?

Actions that health care providers can take to combat implicit bias, include: Having a basic understanding of the cultures from which your patients come. Avoiding stereotyping your patients; individuate them. Understanding and respecting the magnitude of unconscious bias.

Craving More Content?

Read our latest blog posts

What diseases can a lipid panel detect.

Pancreatitis. Chronic kidney disease. Hypothyroidism. What is lipid profile disease? What is a lipid disorder? If your doctor says you have a lipid disorder, that means…

How do you find the missing percent abundance?

To calculate the percent abundance of each isotope in a sample of an element, chemists usually divide the number of atoms of a particular isotope by…

Does basicity increase with electronegativity?

When moving vertically within a given column of the periodic table, we again observe a clear periodic trend in acidity. This is best illustrated with the…

NIMH Logo

Transforming the understanding and treatment of mental illnesses.

Información en español

Celebrating 75 Years! Learn More >>

  • Science News
  • Meetings and Events
  • Social Media
  • Press Resources
  • Email Updates
  • Innovation Speaker Series

Day Two: Placebo Workshop: Translational Research Domains and Key Questions

July 11, 2024

July 12, 2024

Day 1 Recap and Day 2 Overview

ERIN KING: All right. It is 12:01 so we'll go ahead and get started. And so on behalf of the Co-Chairs and the NIMH Planning Committee, I'd like to welcome you back to day two of the NIMH Placebo Workshop, Translational Research Domains and Key Questions. Before we begin, I will just go over our housekeeping items again. So attendees have been entered into the workshop in listen-only mode with cameras disabled. You can submit your questions via the Q&A box at any time during the presentation. And be sure to address your question to the speaker that you would like to respond.

For more information on today's speakers, their biographies can be found on the event registration website. If you have technical difficulties hearing or viewing the workshop, please note these in the Q&A box and our technicians will work to fix the problem. And you can also send an e-mail to [email protected]. And we'll put that e-mail address in the chat box for you. This workshop will be recorded and posted to the NIMH event web page for later viewing.

Now I would like to turn it over to our workshop Co-Chair, Dr. Cristina Cusin, for today's introduction.

CRISTINA CUSIN: Thank you so much, Erin. Welcome, everybody. It's very exciting to be here for this event.

My job is to provide you a brief recap of day one and to introduce you to the speakers of day two. Let me share my slides.

Again, thank you to the amazing Planning Committee. Thanks to their effort, we think this is going to be a success. I learned a lot of new information and a lot of ideas for research proposals and research projects from day one. Very briefly, please go and watch the videos. They are going to be uploaded in a couple of weeks if you missed them.

But we had an introduction from Tor, my Co-Chair. We had an historic perspective on clinical trials from the industry regulatory perspective. We had the current state from the FDA on placebo.

We had an overview of how hard it is to sham, to provide the right sham for device-based trials, and the challenges for TMS. We have seen some new data on the current state of placebo in psychosocial trials and what is the equivalent of a placebo pill for psychosocial trials. And some social neuroscience approach to placebo analgesia. We have come a long way from snake oil and we are trying to figure out what is placebo.

Tor, my Co-Chair, presented some data on the neurocircuitry underlying placebo effect and the questions that how placebo is a mixture of different elements including regression to the mean, sampling bias, selective attrition for human studies, the natural history of illness, the placebo effect per se that can be related to expectations, context, learning, interpretation.

We have seen a little bit of how is the impact on clinical trial design and how do we know that something, it really works. Or whatever this "it" is. And why do even placebo effect exists? It's fascinating idea that placebo exists as a predictive control to anticipate threats and the opportunity to respond in advance and to provide causal inference, a construct perception to infer the underlying state of body and of world.

We have seen historical perspective. And Ni Aye Khin and Mike Detke provided some overview of 25 years of randomized control trials from the data mining in major depressive disorders, schizophrenia trials and the lessons we have learned.

We have seen some strategies, both historical strategies and novel strategies to decrease placebo response in clinical trials and the results. Start from trial design, SPCD, lead-in, placebo phase and flexible dosing. Use of different scales. The use of statistical approaches like last observation carried forward or MMRM. Centralized ratings, self-rating, computer rating for different assessments. And more issues in clinical trials related to patient selection and professional patients.

Last, but not least, the dream of finding biomarkers for psychiatric conditions and tying response, clinical response to biomarkers. And we have seen how difficult it is to compare more recent studies with studies that were started in the '90s.

We have the FDA perspective with Tiffany Farchione in this placebo being a huge issue from the FDA. Especially the discussion towards the end of the day was on how to blind psychedelics.

We have seen an increasing placebo response rate in randomized controlled trials, also in adolescents, that is. And the considerations from the FDA of novel design models in collaboration with industry. We had examples of drugs approved for other disorders, not psychiatric condition, and realized -- made me realize how little we know about the true pathophysiology of psychiatric disorders, likely also heterogeneous conditions.

It made me very jealous of other fields because they have objective measures. They have biology, they have histology, they have imaging, they have lab values. While we are -- we are far behind, and we are not really able to explain to our patients why our mitigations are supposed to work or how they really work.

We heard from Holly Lisanby and Zhi-De Deng. The sham, the difficulty in producing the right sham for each type of device because most of them have auxiliary effects that are separate from the clinical effect like the noise or the scalp stimulation for TMS.

And it's critical to obtain a true blinding and separating sham from verum. We have seen how in clinical trials for devices expectancy from the patient, high tech environment and prolonged contact with clinicians and staff may play a role. And we have seen how difficult it is to develop the best possible sham for TMS studies in tDCS. It's really complicated and it's so difficult also to compare published studies in meta-analysis because they've used very different type of sham. Not all sham are created equal. And some of them could have been biologically active, so therefore invalidating the result or making the study uninformative.

Then we moved on to another fascinating topic with Dr. Rief and Dr. Atlas. What is the impact of psychological factors when you're studying a psychological intervention. Expectations, specific or nonspecific factors in clinical trials and what is interaction between those factors.

More, we learned about the potential nocebo effect of standard medical care or being on a wait list versus being in the active arm of a psychotherapy trial. And the fact that we are not accurately measuring the side effect of psychotherapy trial itself. And we heard more a fascinating talk about the neurocircuitry mediating placebo effect -- salience, affective value, cognitive control. And how perception of provider, perception of his or her warmth and competence and other social factors can affect response and placebo response, induce bias in evaluation of acute pain of others. Another very interesting field of study.

From a clinician perspective, this is -- and from someone who conduct clinical trials, all this was extremely informative because in many case in our patient situation no matter how good the treatment is, they have severe psychosocial stressors. They have difficulties to accessing food, to access treatment, transportation, or they live in an extremely stressful environment. So to disentangle other psychosocial factors from the treatment, from the biology is going to be critical to figure out how to treat best our patients.

And there is so much more work to do. Each of us approach the placebo topic for research from a different perspective. And like the blind man trying to understand what is an elephant, we have to endure it, we have to talk to each other, we have to collaborate and understand better the underlying biology, understand different aspect of the placebo phenomena.

And this lead us to the overview for day two. We are going to hear more about other topic that are so exciting. The placebo, the nocebo effect and other predictive factors in laboratory setting. We are going to hear about genetic of the placebo response to clinical trials. More physiological and psychological and neuromechanism for analgesia. And after a brief break around 1:30, we are going to hear about novel biological and behavioral approaches for the placebo effect.

We are going to hear about brain mapping. We are going to hear about other findings from imaging. And we're going to hear about preclinical modeling. There were some questions yesterday about animal models of placebo. And last, but not least, please stay around because in the panel discussion, we are going to tackle some of your questions. And we are going to have two wonderful moderators, Ted Kaptchuk and Matthew Rudorfer. So please stay with us and ask questions. We love to see more challenges for our speakers. And we're going to be all of the panelists from yesterday, from today are going to be present. Thank you so much.

Now we're going to move on to our first speaker of the day. If I am correct according to the last -- Luana.

Measuring & Mitigating the Placebo Effect

LUANA COLLOCA: Thank you very much, Cristina. First, I would love to thank the organizer. This is a very exciting opportunity to place our awareness about this important phenomenon for clinical trials and the clinical practice.

And today, I wish to give you a very brief overview of the psychoneurobiological mechanism of a placebo and nocebo, the description of some pharmacological studies, and a little bit of information on social learning. That is a topic that has been mentioned a little bit yesterday. And finally, the translational part. Can we translate what we learn from mechanistic approach to placebo and nocebo in terms of a disease and symptomatology and eventually predictors is the bigger question.

So we learned yesterday that placebo effects are generated by verbal suggestion, this medication has strong antidepressant effects. Therapeutic prior experience, merely taking a medication weeks, days before being substitute with a simulation of placebo sham treatment. Observation of a benefit in other people, contextual and treatment cue, and interpersonal interactions.

Especially in the fields of pain where we can simulate nociception, painful experience in laboratory setting, we learn a lot about the modulation related to placebo. In particular, expectation can provide a reaction and activation of parts of the brain like frontal area, nucleus accumbens, ventral striatum. And this kind of mechanism can generate a descending stimulation to make the painful nociceptive stimulus less intense.

The experience of analgesia at the level of a pain mechanism translate into a modulation reduction of a pain intensity. But most important, pain unpleasantness and the effective components of the pain. I will show today some information about the psychological factor, the demographic factor as well as genetic factors that can be predictive of placebo effects in the context of a pain.

On the other hand, a growing interest is related to nocebo effects, the negative counter sides of this phenomenon. When we talk about nocebo effects, we refer to increase in worsening of outcome in symptoms related to negative expectation, prior negative therapeutic experience, observing a negative outcome in others, or even mass psychogenic modeling such as some nocebo-related response during the pandemic. Treatment leaflets, the description of all side effects related to a medication. Patient-clinician communication. The informed consent where we list all of the side effects of a procedure or medication as well as contextual cues in clinical encounters.

And importantly, internal factor like emotion, mood, maladaptive cognitive appraisal, negative valence, personality traits, somatosensory features and omics can be predictive of negative worsening of symptom and outcome related to placebo and nocebo effects. In terms of a nocebo very briefly, there is a lot of attention again related to brain imaging with beautiful data show that the brainstem, the spinal cord, the hippocampus play a critical role during nocebo hyperalgesic effects.

And importantly, we learn that about placebo and nocebo through different approach including brain imaging, as we saw yesterday, but also pharmacological approach. We start from realizing that placebo effects are really neurobiological effects with the use of agonist or antagonist.

In other words, we can use a drug to mimic the action of that drug when we replace the drug with a saline solution, for example. In the cartoon here, you can see a brief pharmacological conditioning with apomorphine. Apomorphine is a dopamine agonist. And after three days of administration, apomorphine was replaced with saline solution in the intraoperative room to allow us to understand if we can mimic at the level of neuronal response the effects of apomorphine.

So in brief these are patients undergoing subthalamic EEG installation of deep brain stimulation. You can see here reaching the subthalamic nucleus. So after crossing the thalamus, the zona incerta, the STN, and the substantia nigra, the surgeon localized the area of stimulation. Because we have two subthalamic nuclei, we can use one as control and the other one as target to study in this case the effects of saline solution given after three days of apomorphine.

What we found was in those people who respond, there was consistency in reduction of clinical symptoms. As you can see here, the UPDRS, a common scale to measure rigidity in Parkinson, the frequency of a discharge at the level of neurons and the self-perception, patients with sentences like I feel like after Levodopa, I feel good. This feeling good translate in less rigidity, less tremor in the surgical room.

On the other hand, some participants didn't respond. Consistently we found no clinical improvement, no difference in preference over this drug at the level of a single unit and no set perception of a benefit. This kind of effects started to trigger the questions what is the reason why some people who responded to placebo and pharmacological conditioning and some other people don't. I will try to address this question in the second part of my talk.

On the other hand, we learn a lot about the endogenous modulation of pain and true placebo effects by using in this case an antagonist. The goal in this experiment was to create a painful sensation through a tourniquet. Week one with no treatment. Week two we pre-inject healthy participant with morphine. Week three the same morphine. And week four we replace morphine with placebo.

And you can see that a placebo increase the pain tolerance in terms of imminent. And this was not carryover effects. In fact, the control at week five showed no differences. Part of the participants were pre-injected with an antagonist Naloxone that when we use Naloxone at high dose, we can block the opioids delta and K receptors. You can see that by pre-injecting Naloxone there is a blockage of placebo analgesia, and I would say this morphine-like effects related to placebo given after morphine.

We start to then consider this phenomenon. Is this a way for tapering opioids. And we called this sort of drug-like effects as dose-extending placebo. The idea is that if we use a pharmacological treatment, morphine, apomorphine, as I showed to you, and then we replace the treatment with a placebo, we can create a pharmacological memory, and this can translate into a clinical benefit. Therefore, the dose-extending placebo can be used to extend the benefit of the drug, but also to reduce side effects related to the active drug.

In particular for placebo given after morphine, you can see on this graph, the effects is similarly strong if we do the repetition of a morphine one day apart or one week apart. Interestingly, this is the best model to be used in animal research.

Here at University of Maryland in collaboration with Todd Degotte, we create a model of anhedonia in mice and we condition animals with Ketamine. The goal was to replace Ketamine with a placebo. There are several control as you can see. But what is important for us, we condition animal with Ketamine week one, three and five. And then we substitute Ketamine with saline along with the CS. The condition of the stimulus was a light, a low light. And we want to compare this with an injection of Ketamine given at week seven.

So as you can see here, of course Ketamine was inducing a benefit as compared to saline and the Ketamine. But what is seen testing when we compare Ketamine week seven with saline replacing the Ketamine, we found no difference; suggesting that even in animals, in mice we were able to create drug-related effects. In this case, a Ketamine antidepressant-like placebo effects. These effects also add dimorphic effects in the sense that we observed this is in males but not in females.

So another approach to use agonist, like I mentioned for apomorphine in Parkinson patient, was to use vasopressin and oxytocin to increase placebo effects. In this case, we used verbal suggestion that in our experience especially with healthy participants tended to create very small sample size in terms of placebo analgesic effects. So we knew that from the literature that there is a dimorphic effects for this hormone. So we inject people with intranasal vasopressin, saline, oxytocin in low dose and no treatment. You can see there was a drug effects in women whereby vasopressin boost placebo analgesic effects, but not in men where yet we found many effects of manipulation but not drug effects.

Importantly, vasopressin affect dispositional anxiety as well as cortisol. And there is a negative correlation between anxiety and cortisol in relationship to vasopressin-induced placebo analgesia.

Another was, can we use medication to study placebo in laboratory setting or can we study placebo and nocebo without any medication? One example is to use a manipulation of the intensity of the painful stimulations. We use a thermal stimulation tailored at three different levels. 80 out of 100 with a visual analog scale, 50 or 20, as you can see from the thermometer.

We also combined the level of pain with a face. So first to emphasize there is three level of pain, participants will see an anticipatory cue just before the thermal stimulation. Ten seconds of the thermal stimulation to provide the experience of analgesia with the green and the hyperalgesia with the red as compared to the control, the yellow condition.

Therefore, the next day we move in the fMRI. And the goal was to try to understand to what extent expectation is relevant in placebo and nocebo effects. We mismatch what they anticipate, and they learn the day before. But also you can see we tailored the intensity at the same identical level. 50 for each participant.

We found that when expectation matched the level of the cues, anticipatory cue and face, we found a strong nocebo effects and placebo effects. You can see in red that despite the level of pain were identical, the perceived red-related stimuli as higher in terms of intensity, and the green related the stimuli as lower when compared to the control. By mismatching what they expect with what they saw, we blocked completely placebo effects and still nocebo persist.

So then I showed to you that we can use conditioning in animals and in humans to create placebo effects. But also by suggestion, the example of vasopressin. Another important model to study placebo effects in laboratory setting is social observation. We see something in other people, we are not told what we are seeing and we don't experience the thermal stimulation. That is the setting. A demonstrator receiving painful or no painful stimulation and someone observing this stimulation.

When we tested the observers, you can see the level of pain were tailored at the same identical intensity. And these were the effects. In 2009, when we first launched this line of research, this was quite surprising. We didn't anticipate that merely observing someone else could boost the expectations and probably creating this long-lasting analgesic effect. This drove our attention to the brain mechanism of what is so important during this transfer of placebo analgesia.

So we scanned participants when they were observing a video this time. And a demonstrator receiving control and placebo cream. We counterbalance the color. We controlled for many variables. So during the observation of another person when they were not stimulated, they didn't receive the cream, there is an activation of the left and right temporoparietal junction and a different activation of the amygdala with the two creams. And importantly, an activation of the periaqueductal gray that I show to you is critical in modulating placebo analgesia.

Afterwards we put both the placebo creams with the two different color. We tailored the level of pain at the identical level of intensity. And we saw how placebo effects through observation are generated. They create strong different expectation and anxiety. And importantly, we found that the functional connectivity between the dorsolateral prefrontal cortex and temporoparietal junction that was active during the observation mediate the behavior results. Suggesting that there is some mechanism here that may be relevant to exploit in clinical trials and clinical practice.

From this, I wish to switch to a more translational approach. Can we replicate these results observed in health participant for nociception in people suffering from chronic pain. So we chose as population of facial pain that is an orphan disease that has no consensus on how to treat it, but also it affects the youngest including children.

So participants were coming to the lab. And thus you can see we used the same identical thermal stimulation, the same electrodes, the same conditioning that I showed to you. We measured expectation before and after the manipulation. The very first question was can we achieve similar monitored distribution of placebo analgesia in people suffering chronically from pain and comorbidities. You can see that we found no difference between temporo parenthala, between TMD and controls. Also, we observed that some people responded to the placebo manipulation with hyperalgesia. We call this nocebo effect.

Importantly, these affects are less relevant than the benefit that sometime can be extremely strong show that both health control and TMD. Because we run experiment in a very beautiful ecological environment where we are diverse, the lab, the experimenters as well as the population we recruit in the lab has a very good distribution of race, ethnicity.

So the very first question was we need to control for this factor. And this turned out to be a beautiful model to study race, ethnicity in the lab. So when chronic pain patient were studied by same experimenter race, dark blue, we observe a larger placebo effect. And this tell us about the disparity in medicine. In fact, we didn't see these effects in our controls.

In chronic pain patient, we also saw a sex concordance influence. But in the opposite sense in women studied by a man experimenter placebo effects are larger. Such an effect was not seen in men.

The other question that we had was what about the contribution of psychological factors. At that stage, there were many different survey used by different labs. Some are based on the different area of, you know, the states of the world, there were trends where in some people in some study they observe an effects of neurodisease, more positive and negative set, that refer to the words. Instead of progressing on single survey, and now we have a beautiful meta-analysis today that is not worth in the sense that it is not predictive of placebo effects.

We use the rogue model suggested by the NIMH. And by doing a sophisticated approach we were able to combine this into four balances. Emotional distress, reward-seeking, pain related fear catastrophizing, empathy and openness. These four valences then were interrelated to predict placebo effects. And you can see that emotional distress is associated with lower magnitude of placebo effects extinguishing over time and lower proportion of placebo responsivity.

Also people who tend to catastrophizing display lower magnitude of placebo effects. In terms of expectation, it is also interesting patients expect to benefit, they have this desire for a reward. But also those people who are more open and characterized by empathy tend for the larger expectations. But this doesn't translate necessarily in larger placebo effects, somehow hinting that the two phenomenon can be not necessarily linked.

Because we study chronic pain patients they come with their own baggage of disease comorbidities. And Dr. Wang in his department look at insomnia. Those people suffering from insomnia tends to have lower placebo analgesic effects along with those who have a poor pattern of sleep, suggesting that clinical factor can be relevant when we wish to predict placebo effects.

Another question that we address how simple SNPs, single nucleotide polymorphism variants in three regions that have been published can be predictive of placebo effects. In particular, I'm referring to OPRM1 that is linked to the gene for endogenous opioids. COMT linked to endogenous dopamine. And FAAH linked to endogenous cannabinoids. And we will learn about that more with the next talk.

And you can see that there is a prediction. These are rogue codes that can be interesting. We model all participants with verbal suggestion alone, the conditioning. There isn't really a huge difference between using one SNP versus two or three. What is truly impact and was stronger in terms of prediction was accounting for the procedure we used to study placebo. Whether by suggestion alone versus condition. When we added the manipulation, the prediction becomes stronger.

More recently, we started gene expression transcriptomic profile associated with placebo effects. We select from the 402 participants randomly 54. And we extract their transcriptomic profiles. Also we select a validation cohort to see if we can't replicate what we discover in terms of mRNA sequencing. But we found over 600 genes associated with the discovered cohort. In blue are the genes downregulated and in red upregulated.

We chose the top 20 genes and did the PCA to validate the top 20. And we found that six of them were replicated and they include all these genes that you see here. The Selenom for us was particularly interesting, as well as the PI3, the CCDC85B, FBXL15, HAGHL and the TNFRSF4. So with this --

LUANA COLLOCA: Yes, I'm done. With this, that is the goal probably one day with AI and other approach to combine clinical psychological brain imaging and so on, characteristic and behavior to predict a level of transitory response to placebo. That may guide us in clinical trials and clinical path to tailor the treatment. Therefore, the placebo and nocebo biological response can be to some extent predicted. And identify those who responded to placebo can help tailoring drug development and symptom management.

Thank you to my lab. All of you, the funding agencies. And finally, for those who like to read more about placebo, this book is available for free to be downloaded. And they include many of the speakers from this two-day event as contributors to this book. Thank you very much.

CRISTINA CUSIN: Thank you so much, Luana. It was a wonderful presentation. We have one question in the Q&A.

Elegant studies demonstrating powerful phenomena. Two questions. Is it possible to extend or sustain placebo-boosting effect? And what is the dose response relationship with placebo or nocebo?

LUANA COLLOCA: Great questions. The goal is to boost a placebo effects. And one way, as I showed was, for example, using intranasal vasopressin. But also extending relationship with placebo we know that we need the minimum of a three or four other administration before boosting this sort of pharmacological memory. And the longer is the administration of the active drug before we replace with placebo, the larger the placebo effects.

For nocebo, we show similar relationship with the collaborators. So again, the longer we condition, the stronger the placebo or nocebo effects. Thank you so much.

CRISTINA CUSIN: I wanted to ask, do you have any theory or interpretation about the potential for transmit to person a placebo response between the observer or such, do you have any interpretation of this phenomenon?

LUANA COLLOCA: It is not completely new in the literature. There is a lot of studies show that we can transfer pain in both animal models and humans.

So transfer analgesia is a natural continuation of that line of research. And the fact that we mimic things that we see in some other people, this is the very most basic form of learning when we grow up. But also from a revolutionary point of view protect us from predators and animals and us as human beings observing is a very good mechanism to boost behaviors and in this case placebo effects. Thank you.

CRISTINA CUSIN: Okay. We will have more time to ask questions.

We are going to move on to the next speaker. Dr. Kathryn Hall.

KATHRYN HALL: Thank you. Can you see my screen okay? Great.

So I'm going to build on Dr. Colloca's talk to really kind of give us a deeper dive into the genetics of the placebo response in clinical trials.

So I have no disclosures. So as we heard and as we have been hearing over the last two days, there is -- there are physiological drivers of placebo effects, whether they are opioid signaling or dopamine signaling. And these are potentiated by the administration or can be potentiated by saline pills, saline injections, sugar pills. And what's really interesting here, I think, is this discussion about how drugs impact the drivers of placebo response. In particular we heard about Naloxone yesterday and proglumide.

What I really want to do today is think about the next layer. Like how do the genes that shape our biology and really drive or influence that -- those physiological drivers of placebo response, how do the genes, A, modify our placebo response? But also, how are they modifying the effect of the drugs and the placebos on this basic -- this network?

And if you think about it, we really don't know much about all of the many interactions that are happening here. And I would actually argue that it goes even beyond genetic variation to other factors that lead to heterogeneity in clinical trials. Today I'm going to really focus on genes and variations in the genome.

So let's go back so we have the same terminology. I'm going to be talking about placebo-responsing trials. And so we saw this graph or a version of this graph yesterday where in clinical trials when we want to assess the effect of a drug, we subtract the outcomes in the placebo arm from the outcomes in the drug treatment arm. And there is a basic assumption here that the placebo response is additive to the drug response.

And what I want to do today is to really challenge that assumption. I want to challenge that expectation. Because I think we have enough literature and enough studies that have already been done that demonstrate that things are not as simple as that and that we might be missing a lot from this basic averaging and subtracting that we are doing.

So the placebo response is that -- is the bold lines there which includes placebo effects which we have been focusing on here. But it also includes a natural history of the disease or the condition, phenomenon such as statistical regression not mean, blinding and bias and Hawthorn effects. So we lump all of those together in the placebo arm of the trial and subtract the placebo response from the drug response to really understand the drug effect.

So one way to ask about, well, how do genes affect this is to look at candidate genes. And as Dr. Colloca pointed out and has done some very elegant studies in this area, genes like COMT, opioid receptors, genes like OPRM1, the FAAH endocannabinoid signaling genes are all candidate genes that we can look at in clinical trials and ask did these genes modify what we see in the placebo arm of trials?

We did some studies in COMT. And I want to just show you those to get a -- so you can get a sense of how genes can influence placebo outcomes. So COMT is catacholamethyl transferase. And it's a protein, an enzyme that metabolizes dopamine which as you saw is important in mediating the placebo response. COMT also metabolizes epinephrin, norepinephrine and catecholest estrogen. So the fact that COMT might be involved in placebo response is really interesting because it might be doing more than just metabolizing dopamine.

So we asked the question what happens if we look at COMT genetic variation in clinical trials of irritable bowel syndrome? And working with Ted Kaptchuk and Tony Lembo at Beth Israel Deaconess Medical Center, we did just that. We looked at COMT effects in a randomized clinical trial of irritable bowel syndrome. And what we did see was that for the gene polymorphism RS46AD we saw that people who had the weak version of the COMT enzyme actually had more placebo response. These are the met/met people here shown on this, in this -- by this arrow. And that the people who had less dopamine because that enzyme didn't work as well for this polymorphism, they had less of a placebo response in one of the treatment arms. And we would later replicate this study in another clinical trial that was recently concluded in 2021.

So to get a sense, as you can see, we are somewhat -- we started off being somewhat limited by what was available in the literature. And so we wanted to expand on that to say more about genes that might be associated with placebo response. So we went back, and we found 48 studies in the literature where there was a gene that was looked at that modified the placebo response.

And when we mapped those to the interactome, which is this constellation of all gene products and their interactions, their physical interactions, we saw that the placebome or the placebo module had certain very interesting characteristics. Two of those characteristics that I think are relevant here today are that they overlapped with the targets of drugs, whether they were analgesics, antidepressive drugs, anti-Parkinson's agents, placebo genes putatively overlapped with drug treatment genes or targets.

They also overlapped with disease-related genes. And so what that suggests is that when we were looking at the outcomes of clinical trial there might be a lot more going on that we are missing.

And let's just think about that for a minute. On the left is what we expect. We expect that we are going to see an effect in the drug, it's going to be greater than the effect of the placebo and that difference is what we want, that drug effect. But what we often see is on the right here where there is really no difference between drug and placebo. And so we are left to scratch our heads. Many companies go out of business. Many sections of companies close. And, quite frankly, patients are left in need. Money is left on the table because we can't discern between drug and placebo.

And I think what is interesting is that's been a theme that's kind of arisen since yesterday where oh, if only we had better physiological markers or better genes that targeted physiology then maybe we could see a difference and we can, you know, move forward with our clinical trials.

But what I'm going to argue today is actually what we need to do is to think about what is happening in the placebo arm, what is contributing to the heterogeneity in the placebo arm, and I'm going to argue that when we start to look at that compared to what is happening in the drug treatment arm, oftentimes -- and I'm going to give you demonstration after demonstration. And believe me, this is just the tip of the iceberg.

What we are seeing is there are differential effects by genotype in the drug treatment arm and the placebo treatment arm such that if you average out what's happening in these -- in these drug and placebo arms, you would basically see that there is no difference. But actually there's some people that are benefiting from the drug but not placebo. And conversely, benefiting from placebo but not drug. Average out to no difference.

Let me give you some examples. We had this hypothesis and we started to look around to see if we could get partners who had already done clinical trials that had happened to have genotyped COMT. And what we saw in this clinical trial for chronic fatigue syndrome where adolescents were treated with clonidine was that when we looked in the placebo arm, we saw that the val/val patients, so this is the COMT genotype. The low activity -- sorry, that is high activity genotype. They had the largest number increase in the number of steps they were taking per week. In contrast, the met/met people, the people with the weaker COMT had fewer, almost no change in the number of steps they were taking per week.

So you would look at this and you would say, oh, the val/val people were the placebo responders and the met/met people didn't respond to placebo. But what we saw when we looked into the drug treatment arm was very surprising. We saw that clonidine literally erased the effect that we were seeing in placebo for the val/val participants in this trial. And clonidine basically was having no effect on the heterozygotes, the val/mets or on the met/mets. And so this trial rightly concluded that there was no benefit for clonidine.

But if they hadn't taken this deeper look at what was happening, they would have missed that clonidine may potentially be harmful to people with chronic fatigue in this particular situation. What we really need to do I think is look not just in the placebo or not just in the drug treatment arm but in both arms to understand what is happening there.

And I'm going to give you another example. And, like I said, the literature is replete with these examples. On the left is an example from a drug that was used to test cognitive -- in cognitive scales, Tolcupone, which actually targets COMT. And what you can see here again on the left is differential outcomes in the placebo arm and in the drug treatment arm that if you were to just average these two you would not see the differences.

On the right is a really interesting study looking at alcohol among people with alcohol disorder, number of percent drinking days. And they looked at both COMT and OPRM1. And this is what Dr. Colloca was just talking about there seemed to be not just gene-placebo drug interactions but gene-gene drug placebo interactions. This is a complicated space. And I know we like things to be very simple. But I think what these data are showing is we need to pay more attention.

So let me give you another example because these -- you know, you could argue, okay, those are objective outcomes -- sorry, subjective outcomes. Let's take a look at the Women's Health Study. Arguably, one of the largest studies on aspirin versus placebo in history. 30,000 women were randomized to aspirin or placebo. And lo and behold, after 10 years of following them the p value was nonsignificant. There was no difference between drug and placebo.

So we went to this team, and we asked them, could we look at COMT because we had a hypothesis that COMT might modify the outcomes in the placebo arm and potentially differentially modify the treatments in the drug treatment arm. You might be saying that can't have anything to do with the placebo effect and we completely agree. This if we did find it would suggest that there might be something to do with the placebo response that is related to natural history. And I'm going to show you the data that -- what we found.

So when we compared the outcomes in the placebo arm to the aspirin arm, what we found was the met/met women randomized to placebo had the highest of everybody rates of cardiovascular disease. Which means the highest rates of myocardial infarction, stroke, revascularization and death from a cardiovascular disease cause. In contrast, the met/met women on aspirin had benefit, had a statistically significant reduction in these rates.

Conversely, the val/val women on placebo did the best, but the val/val women on aspirin had the highest rates, had significantly higher rates than the val/val women on placebo. What does this tell us? Well, we can't argue that this is a placebo effect because we don't have the control for placebo effects, which is a no treatment control.

But we can say that these are striking differences that, like I said before, if you don't pay attention to them, you miss the point that there are subpopulations for benefit or harm because of differential outcomes in the drug and placebo arms of the trial.

And so I'm going to keep going. There are other examples of this. We also partnered with a group at Brigham and Women's Hospital that had done the CAMP study, the Childhood Asthma Management Study. And in this study, they randomized patients to placebo, Budesonide or Nedocromil for five years and study asthma outcomes.

Now what I was showing you previously was candidate gene analyses. What this was, was a GWAS. We wanted to be agnostic and ask are there genes that modify the placebo outcomes and are these outcomes different in the -- when we look in the drug treatment arm. And so that little inset is a picture of all of the genes that were looked at in the GWAS. And we had a borderline genome Y significant hit called BBS9. And when we looked at BBS9 in the placebo arm, those white boxes at the top are the baseline levels of coughing and wheezing among these children. And in the gray are at the end of the treatment their level of coughing and wheezing.

And what you can see here is that participants with the AA genotype were the ones that benefited from the Bedenoside -- from placebo, whereas the GG, the patients with the GG genotype really there was no significant change.

Now, when we looked in the drug treatment arms, we were surprised to see that the outcomes were the same, of course, at baseline. There is no -- everybody is kind of the same. But you can see the differential responses depending on the genotype. And so, again, not paying attention to these gene drug/placebo interactions we miss another story that is happening here among our patients.

Now, I just want to -- I added this one because it is important just to realize that this is not just about gene-drug placebo. But these are also about epigenetic effects. And so here is the same study that I showed earlier on alcohol use disorder. They didn't just stop at looking at the polymorphisms or the genetic variants. This team also went so far as to look at methylation of OPRM1 and COMT.

So methylation is basically when the promoter region of a gene is basically blocked because it has a methyl group. It has methylation on some of the nucleotides in that region. So you can't make the protein as efficiently. And if you look on the right, what you can see in the three models that they looked at, they looked at other genes. They also looked at SLC6A3 that's involved in dopamine transport. And what you can see here is that there is significant gene by group by time interactions for all these three genes, these are candidate genes that they looked at.

And even more fascinating is their gene-by-gene interactions. Basically it is saying that you cannot say what the outcome is going to be unless you know the patient's or the participant's COMT or OPRM genotype A and also how methylated the promoter region of that -- of these genes are. So this makes for a very complicated story. And I know we like very simple stories.

But I want to say that I'm just adding to that picture that we had before to say that it's not just in terms of the gene's polymorphisms, but as Dr. Colloca just elegantly showed it is transcription as well as methylation that might be modifying what is happening in the drug treatment arm and the placebo treatment arm. And to add to this it might also be about the natural history of the condition.

So BBS9 is actually a gene that is involved in the cilia, the activity of the formation of the cilia which is really important in breathing in the nasal canal. And so, you can see that it is not just about what's happening in the moment when you are doing the placebo or drug or the clinical trial, it also might -- the genes might also be modifying where the patient starts out and how the patient might develop over time. So, in essence, we have a very complicated playground here.

But I think I have shown you that genetic variation, whether it is polymorphisms in the gene, gene-gene interactions or epigenetics or all of the above can modify the outcomes in placebo arms of clinical trials. And that this might be due to the genetic effects on placebo effects or the genetic effects on natural history. And this is something I think we need to understand and really pay attention to.

And I also think I've showed you, and these are just a few examples, there are many more. But genetic variation can differentially modify drugs and placebos and that these potential interactive effects really challenge this basic assumption of additivity that I would argue we have had for far too long and we really need to rethink.

TED KAPTCHUK: (Laughing) Very cool.

KATHRYN HALL: Hi, Ted.

TED KAPTCHUK: Oh, I didn't know I was on.

KATHRYN HALL: Yeah, that was great. That's great.

So in summary, can we use these gene-placebo drug interactions to improve clinical trials. Can we change our expectations about what is happening. And perhaps as we have been saying for the last two days, we don't need new drugs with clear physiological effects, what we need is to understand drug and placebo interactions and how they impact subpopulations and can reveal who benefits or is harmed by therapies.

And finally, as we started to talk about in the last talk, can we use drugs to boost placebo responses? Perhaps some drugs already do. Conversely, can we use drugs to block placebo responses? And perhaps some drugs already do.

So I just want to thank my collaborators. There was Ted Kaptchuk, one of my very close mentors and collaborators. And really, thank you for your time.

CRISTINA CUSIN: Thank you so much. It was a terrific presentation. And definitely Ted's captured laugh, it was just one of the best spontaneous laughs.

We have a couple of questions coming through the chat. One is about the heterogeneity of response in placebo arms. It is not uncommon to see quite a dispersion of responses at trials. Was that thought experiment, if one looks at the fraction of high responders in the placebo arms, would one expect to see, enrich for some of the genetic marker for and as placebo response?

KATHRYN HALL: I absolutely think so. We haven't done that. And I would argue that, you know, we have been having kind of quiet conversation here about Naloxone because I think as Lauren said yesterday that the findings of Naloxone is variable. Sometimes it looks like Naloxone is blocking placebo response and sometimes it isn't.

We need to know more about who is in that trial, right? Is this -- I could have gone on and showed you that there is differences by gender, right. And so this heterogeneity that is coming into clinical trials is not just coming from the genetics. It's coming from race, ethnicity, gender, population. Like are you in Russia or are you in China or are you in the U.S. when you're conducting your clinical trial? We really need to start unpacking this and paying attention to it. I think because we are not paying attention to it, we are wasting a lot of money.

CRISTINA CUSIN: And epigenetic is another way to consider traumatic experiences, adverse event learning. There is another component that we are not tracking accurately in clinical trials. I don't think this is a one of the elements routinely collected. Especially in antidepressant clinical trials it is just now coming to the surface.

KATHRYN HALL: Thank you.

CRISTINA CUSIN: Another question comes, it says the different approaches, one is GWAS versus candidate gene approach.

How do you start to think about genes that have a potential implication in neurophysiological pathways and choosing candidates to test versus a more agnostic U.S. approach?

KATHRYN HALL: I believe you have to do both because you don't know what you're going to find if you do a GWAS and it's important to know what is there.

At the same time, I think it's also good to test our assumptions and to replicate our findings, right? So once you do the GWAS and you have a finding -- for instance, our BBS9 finding would be amazing to replicate or to try and test in another cohort. But, of course, it is really difficult to do a whole clinical trial again. These are very expensive, and they last many years.

And so, you know, I think replication is something that is tough to do in this space, but it is really important. And I would do both.

CRISTINA CUSIN: Thank you. We got a little short on time. We are going to move on to the next speaker. Thank you so much.

FADEL ZEIDAN: Good morning. It's me, I imagine. Or good afternoon.

Let me share my screen. Yeah, so good morning. This is going to be a tough act to follow. Dr. Colloca and Dr. Hall's presentations were really elegant. So manage your expectations for mine. And, Ted, please feel free to unmute yourself because I think your laugh is incredibly contagious, and I think we were all were laughing as well.

So my name is Fadel Zeidan, I'm at UC San Diego. And I'll be discussing mostly unpublished data that we have that's under review examining if and how mindfulness meditation assuages pain and if the mechanism supporting mindfulness meditation-based analgesia are distinct from placebo.

And so, you know, this is kind of like a household slide that we all are here because we all appreciate how much of an epidemic chronic pain is and, you know, how significant it is, how much it impacts our society and the world. And it is considered a silent epidemic because of the catastrophic and staggering cost to our society. And that is largely due to the fact that the subjective experience of pain is modulated and constructed by a constellation of interactions between sensory, cognitive, emotional dimensions, genetics, I mean I can -- the list can go on.

And so what we've been really focused on for the last 20 years or so is to appreciate if there is a non-pharmacological approach, a self-regulated approach that can be used to directly assuage the experience of pain to acutely modify exacerbated pain.

And to that extent, we've been studying meditation, mindfulness-based meditation. And mindfulness is a very nebulous construct. If you go from one lab to another lab to another lab, you are going to get a different definition of what it is. But obviously my lab's definition is the correct one. And so the way that we define it is awareness of arising sensory events without reaction, without judgment.

And we could develop this construct, this disposition by practicing mindfulness-based meditation, which I'll talk about here in a minute. And we've seen a lot of -- and this is an old slide -- a lot of new evidence, converging evidence demonstrating that eight weeks of manualized mindfulness-based interventions can produce pretty robust improvements in chronic pain and opiate misuse. These are mindfulness-based stress reduction programs, mindfulness-oriented recovery enhancement, mindfulness-based cognitive therapy which are about eight weeks long, two hours of formalized didactics a week, 45 minutes a day of homework.

There is yoga, there is mental imagery, breathing meditation, walking meditation, a silent retreat and about a $600 tab. Which may not be -- I mean although they are incredibly effective, may not be targeting demographics and folks that may not have the time and resources to participate in such an intense program.

And to that extent and, you know, as an immigrant to this country I've noticed that we are kind of like this drive-thru society where, you know, we have a tendency to eat our lunches and our dinners in our cars. We're attracted to really brief interventions for exercise or anything really, pharmaceuticals, like ":08 Abs" and "Buns of Steel." And we even have things called like the military diet that promise that you'll lose ten pounds in three days without dying.

So we seemingly are attracted to these fast-acting interventions. And so to this extent we've worked for quite some time to develop a very user friendly, very brief mindfulness-based intervention. So this is an intervention that is about four sessions, 20 minutes each session. And participants are -- we remove all religious aspects, all spiritual aspects. And we really don't even call it meditation, we call it mindfulness-based mental training.

And our participants are taught to sit in a straight posture, close their eyes, and to focus on the changing sensations of the breath as they arise. And what we've seen is this repetitive practice enhances cognitive flexibility and the ability to -- flexibility and the ability to sustain attention. And when individual's minds drift away from focusing on the breath, they are taught to acknowledge distractive thoughts, feelings, emotions without judging themselves or the experience. Doing so by returning their attention back to the breath.

So there is really a one-two punch here where, A, you're focusing on the breath and enhancing cognitive flexibility; and, B, you're training yourself to not judge discursive events. And that we believe enhances emotion regulation. So quite malleable to physical training we would say mental training. Now that we have the advent of imaging, we can actually see that there are changes in the brain related to this.

But as many of you know, mindfulness is kind of like a household term now. It's all over our mainstream media. You know, we have, you know, Lebron meditating courtside. Oprah meditating with her Oprah blanket. Anderson Cooper is meditating on TV. And Time Magazine puts, you know, people on the cover meditating. And it's just all over the place.

And so these types of images and these types of, I guess, insinuations could elicit nonspecific effects related to meditation. And for quite some time I've been trying to really appreciate not is meditation more effective than placebo, although that's interesting, but does mindfulness meditation engage mechanisms that also are shared by placebo? So beliefs that you are meditating could elicit analgesic responses.

The majority of the manualized interventions in their manuals they use terms like the power of meditation, which I guarantee you is analgesic. To focus on the breath, we need to slow the breath down. Not implicit -- not explicitly, but it just happens naturally. And slow breathing can also reduce pain. Facilitator attention, social support, conditioning, all factors that are shared with other therapies and interventions but in particular are also part of meditation training.

So the question is because of all this, is mindfulness meditation merely -- or not merely after these two rich days of dialogue -- but is mindfulness meditation engaging processes that are also shared by placebo.

So if I apply a placebo cream to someone's calf and then throw them in the scanner versus asking someone to meditate, the chances are very high that the brain processes are going to be distinct. So we wanted to create a -- and validate an operationally matched mindfulness meditation intervention that we coined as sham mindfulness meditation. It's not sham meditation because it is meditation. It's a type of meditative practice called Pranayama.

But here in this intervention we randomize folks, we tell folks that they've been randomized to a genuine mindfulness meditation intervention. Straight posture, eyes closed. And every two to three minutes they are instructed to, quote-unquote, take a deep breath as we sit here in mindfulness meditation. We even match the time giving instructions between the genuine and the sham mindfulness meditation intervention.

So the only difference between the sham mindfulness and the genuine mindfulness is that the genuine mindfulness is taught to explicitly focus on the changing sensations of the breath without judgment. The sham mindfulness group is just taking repetitive deep, slow breaths. So if the magic part of mindfulness, if the active component of mindfulness is this nonjudgmental awareness, then we should be able to see disparate mechanisms between these.

And we also use a third arm, a book listening control group called the "Natural History of Selborne" where it's a very boring, arguably emotionally pain-evocating book for four days. And this is meant to control for facilitator time and -- sorry, facilitator attention and the time elapsed in the other group's interventions.

So we use a very high level of noxious heat to the back of the calf. And we do so because imaging is quite expensive, and we want to ensure that we can see pain-related processing within the brain. Here and across all of our studies, we use ten 12-second plateaus of 49 degrees to the calf, which is pretty painful.

And then we assess pain intensity and pain unpleasantness using a visual analog scale, where here the participants just see red the more they pull on the algometer the more in pain they are. But on the back, the numbers fluoresce where 0 is no pain and 10 is the worst pain imaginable.

So pain intensity can be considered like sensory dimension of pain, and pain unpleasantness could be more like I don't want to say pain affect but more like the bothersome component of pain, pain unpleasantness. So what we did was we combined all of our studies that have used the mindfulness, sham mindfulness in this book listing control, to see does mindfulness meditation engage is mindfulness meditation more effective than sham mindfulness meditation at reducing pain.

We also combined two different fMRI techniques: Blood oxygen dependent level signalling, bold, which allows us a higher temporal resolution and signal to noise ratio than, say, perfusion imaging technique and allows us to look at connectivity. However, meditation is also predicated on changes in respiration rate which could elicit pretty dramatic artifacts in the brain, breathing related artifacts explicitly related to CO2 output.

So using the perfusion based fMRI technique like arterial spin labeling is really advantageous as well, although it's not as temporally resolute as bold, it provides us a direct quantifiable measurement of cerebral blood flow.

So straight to the results. On the Y axis we have the pain ratings, and on the X axis are book listening controls sham mindfulness meditation, mindfulness meditation. Here are large sample sizes. Blue is intensity and red is unpleasantness. This is the post intervention fMRI scans where we see the first half of the scan to the second half of the scan our controlled participants are simply resting and pain just increases because of pain sensitization and being in a claustrophobic MRI environment.

And you can see here that sham mindfulness meditation does produce pretty significant reduction in pain intensity and unpleasantness, more than the control book. But mindfulness meditation is more effective than sham mindfulness and the controls at reducing pain intensity and pain unpleasantness.

There does seem to be some kind of additive component to the genuine intervention, although this is a really easy practice, the sham techniques.

So for folks that have maybe fatigue or cognitive deficits or just aren't into doing mindfulness technique, I highly recommend this technique, which is just a slow breathing approach, and it's dead easy to do.

Anyone that's practiced mindfulness for the first time or a few times can state that it can be quite difficult and what's the word? -- involving, right?

So what happened in the brain? These are our CBF maps from two studies that we replicated in 2011 and '15 where we found that higher activity, higher CBF in the right anterior insula, which is ipsilateral to the stimulation site and higher rostral anterior cingulate cortex subgenual ACC was associated with greater pain relief, pain intensity, and in the context of pain unpleasantness, higher over the frontal cortical activity was associated with lower pain, and this is very reproducible where we see greater thalamic deactivation predicts greater analgesia on the unpleasantness side.

These areas, obviously right entry insula in conjunction with other areas is associated with interoceptive processing awareness of somatic sensations. And then the ACC and the OFC are associated with higher order cognitive flexibility, emotional regulation processes. And the thalamus is really the gatekeeper from the brain -- I'm sorry, from the body to the brain. Nothing can enter the brain except unless it goes through the thalamus, except if it's the sense of smell.

So it's really like this gatekeeper of arising nociceptive information.

So the takehome here is that mindfulness is engaging multiple neural processes to assuage pain. It's not just one singular pathway.

Our gold studies were also pretty insightful. Here we ran a PPI analysis, psychophysiologic interaction analysis and this was whole brain to see what brain regions are associated with pain relief on the context of using the bold technique, and we find that greater ventral medial prefrontal cortical activity deactivation I'm sorry is associated with lower pain, and the vmPFC is a super evolved area that's associated with, like, higher order processes relating to self. It's one of the central nodes of the so called default mode network, a network supporting self referential processing. But in the context of the vmPFC, I like the way that Tor and Mathieu reflect the vmPFC as being more related to affective meaning and has a really nice paper showing that vmPFC is uniquely involved in, quote/unquote, self ownership or subjective value, which is particularly interesting for the context of pain because pain is a very personal experience that's directly related to the interpretation of arising sensations and what they mean to us.

And seemingly -- I apologize for the reverse inferencing here -- but seemingly mindfulness meditation based on our qualitative assessments as well is reducing the ownership or the intrinsic value, the contextual value of those painful sensations, i.e., they don't feel like they bother -- that pain is there but it doesn't bother our participants as much, which is quite interesting as a manipulation.

We also ran our connectivity analysis between the contralateral thalamus and the whole brain, and we found that greater decoupling between the contralateral thalamus and the precuneus, another central node of the default mode network predicted greater analgesia.

This is a really cool, I think, together mechanism showing that two separate analyses are indicating that the default mode network could be an analgesic system which we haven't seen before. We have seen the DMN involved in chronic pain and pain related exacerbations, but I don't think we've seen it as being a part of an analgesic, like being a pain relieving mechanism. Interestingly, the thalamus and precuneus together are the first two nodes to go offline when we lose consciousness, and they're the first two nodes to come back online when we recover from consciousness, suggesting that these two -- that the thalamus and precuneus are involved in self referential awareness, consciousness of self, things of this nature.

Again, multiple processes involved in meditation based pain relief which maybe gives rise to why we are seeing consistently that meditation could elicit long lasting improvements in pain unpleasantness, in particular, as compared to sensory pain. Although it does that as well.

And also the data gods were quite kind on this because these mechanisms are also quite consistent with the primary premises of Buddhist and contemplative scriptures saying that the primary principle is that your experiences are not you.

Not that there is no self, but that the processes that arise in our moment to moment experience are merely reflections and interpretations in judgments, and that may not be the true inherent nature of mind.

And so before I get into more philosophical discourse, I'm going to keep going for the sake of time. Okay.

So what happened with the sham mindfulness meditation intervention?

We did not find any neural processes predicted analgesia significantly and during sham mindfulness meditation. What did predict analgesia during sham mindfulness was slower breathing rate, which we've never seen before with mindfulness. We've never seen a significant or even close to significant relationship between mindfulness based meditation analgesia and slow breathing. But over and over we see that sham mindfulness based analgesia is related to slower breathing which provides us this really cool distinct process where kind of this perspective where mindfulness is engaging higher order top down type processes to assuage pain while sham mindfulness may be engaging this more bottom up type response to assuage pain.

I'm going to move on to some other new work, and this is in great collaboration with the lovely Tor Wager, and he's developed, with Marta and Woo, these wonderful signatures, these machine learned multivariate pattern signatures that are remarkably accurate at predicting pain over I think like 98, 99 percent.

His seminal paper, the Neurological Pain Signature, was published in the New England Journal of Medicine that showed that these signatures can predict nociceptive specific, in particular, for this particular, thermal heat pain with incredible accuracy.

And it's not modulated by placebo or affective components, per se. And then the SIIPS is a machine learned signature that is, as they put it, associated with cerebral contributions to pain. But if you look at it closely, these are markers that are highly responsive to the placebo response.

So the SIIPS can be used -- he has this beautiful pre print out, showing that it does respond with incredible accuracy to placebo, varieties of placebo.

So we used this MVPA to see if meditation engages signature supporting placebo responses.

And then Marta Ceko's latest paper with Tor published in Nature and Neuro found that the negative affect of signature predicts pain responses above and beyond nociceptive related processes. So this is pain related to negative affect, which again contributes to the multimodal processing of pain and how now we could use these elegant signatures to kind of disentangle which components of pain meditation and other techniques assuage. Here's the design.

We had 40 -- we combined two studies. One with bold and one with ASL. So this would be the first ASL study with signatures, with these MVPA signatures.

And we had the mindfulness interventions that I described before, the book listing interventions I described before and a placebo cream intervention which I'll describe now, all in response to 49 degrees thermal stimuli.

So across again all of our studies we use the same methods. And the placebo group -- I'll try to be quick about this -- this is kind of a combination of Luana Colloca, Don Price and Tor's placebo conditioning interventions where we administer 49 degrees -- we tell our participants that we're testing a new form of lidocaine, and the reason that it's new is that the more applications of this cream, the stronger the analgesia.

And so in the conditioning sessions, they come in, administer 49 degrees, apply and remove this cream, which is just petroleum jelly after 10 minutes, and then we covertly reduce the temperature to 48.

And then they come back in in session two and three, after 49 degrees and removing the cream, we lower the temperature to 47. And then on the last conditioning session, after we remove the cream, we lower the temperature to 46.5, which is a qualitatively completely different experience than 49.

And we do this to lead our participants to believe that the cream is actually working.

And then in a post intervention MRI session, after we remove the cream, we don't modulate the temperature, we just keep it at 49, and that's how we measured placebo in these studies. And then so here, again -- oops -- John Dean and Gabe are coleading this project.

Here, pain intensity on this axis, pain unpleasantness on that axis, controls from the beginning of the scan to the end of the scan significantly go up in pain.

Placebo cream was effective at reducing intensity and unpleasantness, but we see mindfulness meditation was more effective than all the conditions at reducing pain. The signatures, we see that the nociceptive specific signature, the controls go up in pain here.

No change in the placebo and mindfulness meditation you can see here produces a pretty dramatic reduction in the nociceptive specific signature.

The same is true for the negative affective pain signature. Mindfulness meditation uniquely modifies this signature as well which I believe this is one of the first studies to show something like this.

But it does not modulate the placebo signature. What does modulate the placebo signature is our placebo cream, which is a really nice manipulation check for these signatures.

So here, taken together, we show that mindfulness meditation, again, is engaging multiple processes and is reducing pain by directly assuaging nociceptive specific markers as well as markers supporting negative affect but not modulating placebo related signatures, providing further credence that it's not a placebo type response, and we're also demonstrating this granularity between a placebo mechanism that's not being shared by another active mechanism. While we all assume that active therapies and techniques are using a shared subset of mechanisms or processes with placebo, here we're providing accruing evidence that mindfulness is separate from a placebo.

I'll try to be very quick on this last part. This is all not technically related placebo, but I would love to hear everyone's thoughts on these new data we have.

So as we've seen elegantly that pain relief by placebo, distraction, acupuncture, transcranial magnetic stimulation, prayer, are largely driven by endogenous opioidergic release. And, yes, there are other systems. A prime other system is the (indiscernible) system, serotonergic system, dopamine. The list can go on. But it's considered by most of us that the endogenous opioidergic system is this central pain modulatory system.

And the way we do this is by antagonizing endogenous opioids by employing incredibly high administration dosage of naloxone.

And I think this wonderful paper by Ciril Etnes's (phonetic) group provides a nice primer on the appropriate dosages for naloxone to antagonize opiates. And I think a lot of the discussions here where we see differences in naloxone responses are really actually reflective of differences in dosages of naloxone.

It metabolizes so quickly that I would highly recommend a super large bolus with a maintenance infusion IV.

And we've seen this to be a quite effective way to block endogenous opioids. And across four studies now, we've seen that mindfulness based pain relief is not mediated by endogenous opioids. It's something else. We don't know what that something else is but we don't think it's endogenous opioids. But what if it's sex differences that could be driving these opioidergic versus non opioid opioidergic differences?

We've seen that females require -- exhibit higher rates of chronic pain than males. They are prescribed opiates at a higher rate than men. And when you control for weight, they require higher dosages than men. Why?

Well, there's excellent literature in rodent models and preclinical models that demonstrate that male rodents versus female -- male rodents engage endogenous opioids to reduce pain but female rodents do not.

And this is a wonderful study by Ann Murphy that basically shows that males, in response to morphine, have a greater latency and paw withdrawal when coupled with morphine and not so much with females.

But when you add naloxone to the picture, with morphine, the latency goes down. It basically blocks the analgesia in male rodents but enhances analgesia in female rodents.

We basically asked -- we basically -- Michaela, an undergraduate student doing an odyssey thesis asked this question: Are males and females in humans engaging in distinct systems to assuage pain?

She really took off with this and here's the design. We had heat, noxious heat in the baseline.

CRISTINA CUSIN: Doctor, you have one minute left. Can you wrap up?

FADEL ZEIDAN: Yep. Basically we asked, are there sex differences between males and females during meditation in response to noxious heat? And there are.

Baseline, just change in pain. Green is saline. Red is naloxone. You can see that with naloxone onboard, there's greater analgesia in females, and we reversed the analgesia. Largely, there's no differences between baseline in naloxone in males, and the males are reducing pain during saline.

We believe this is the first study to show something like this in humans. Super exciting. It also blocked the stress reduction response in males but not so much in females. Let me just acknowledge our funders. Some of our team. And I apologize for the fast presentation. Thank you.

CRISTINA CUSIN: Thank you so much. That was awesome.

We're a little bit on short on time.

I suggest we go into a short break, ten minute, until 1:40. Please continue to add your questions in Q&A. Our speakers are going to answer or we'll bring some of those questions directly to the discussion panel at the end of the session today. Thank you so much.

Measuring & Mitigating the Placebo Effect (continued)

CRISTINA CUSIN: Hello, welcome back. I'm really honored to introduce our next speaker, Dr. Marta Pecina. And she's going to talk about mapping expectancy-mood interactions in antidepressant placebo effects. Thank you so much.

MARTA PECINA: Thank you, Cristina. It is my great pleasure to be here. And just I'm going to switch gears a little bit to talk about antidepressant placebo effects. And in particular, I'm going to talk about the relationship between acute expectancy-mood neural dynamics and long-term antidepressant placebo effects.

So while we all know that depression is a very prevalent disorder, and just in 2020, Major Depressive Disorder affected 21 million adults in the U.S. and 280 million adults worldwide. And current projections indicate that by the year 2030 it will be the leading cause of disease burden globally.

Now, response rates to first-line treatments, antidepressant treatments are approximately 50%. And complete remission is only achieved in 30 to 35% of individuals. Also, depression tends to be a chronic disorder with 50% of those recovering from a first episode having an additional episode. And 80% of those with two or more episodes having another recurrence.

And so for patients who are nonresponsive to two intervention, remission rates with subsequent therapy drop significantly to 10 to 25%. And so, in summary, we're facing a disorder that is very resistant or becomes resistant very easily. And in this context, one would expect that antidepressant placebo effects would actually be low. But we all know that this is not the case. The response rate to placebos is approximately 40% compared to 50% response rates to antidepressants. And obviously this varies across studies.

But what we do know and learned yesterday as well is that response rates to placebos have increased approximately 7% over the last 40 years. And so these high prevalence of placebo response in depressions have significantly contributed to the current psychopharmacology crisis where large pharma companies have reduced at least in half the number of clinical trials devoted to CNS disorders.

Now, antidepressant placebo response rates among individuals with depression are higher than in any other psychiatric condition. And this was recently published again in this meta-analysis of approximately 10,000 psychiatric patients. Now, other disorders where placebo response rates are also prevalent are generalized anxiety disorder, panic disorders, HDHC or PTSD. And maybe less frequent, although still there, in schizophrenia or OCD.

Now, importantly, placebo effects appear not only in response to pills but also surgical interventions or devices, as it was also mentioned yesterday. And this is particularly important today where there is a very large development of device-based interventions for psychiatric conditions. So, for example, in this study that also was mentioned yesterday of deep brain stimulation, patients with resistant depression were assigned to six months of either active or some pseudo level DBS. And this was followed by open level DBS.

As you can see here in this table, patients from both groups improved significantly compared to baseline, but there were no significant differences between the two groups. And for this reason, DBS has not yet been approved by the FDA for depression, even though it's been approved for OCD or Parkinson's disease as we all know.

Now what is a placebo effect, that's one of the main questions of this workshop, and how does it work from a clinical neuroscience perspective? Well, as it's been mentioned already, most of what we know about the placebo effect comes from the field of placebo analgesia. And in summary, classical theories of the placebo effect have consistently argued that placebo effects results from either positive expectancies regarding the potential beneficial effects of a drug or classical conditioning where the pairing of a neutral stimulus, in this case the placebo pill, with an unconditioned stimulus, in this case the active drug, results in a conditioned response.

Now more recently, theories of the placebo effect have used computational models to predict placebo effects. And these theories posit that individuals update their expectancies as new sensory evidence is accumulated by signaling the response between what is expected and what is perceived. And this information is then used to refine future expectancies. Now these conceptual models have been incorporated into a trial-by-trial manipulation of both expectancies of pain relief and pain sensory experience. And this has rapidly advanced our understanding of the neural and molecular mechanisms of placebo analgesia.

And so, for example, in these meta analytic studies using these experiments they have revealed really two patterns of distinct activations with decreases in brain activity in regions involving brain processing such as the dorsal medial prefrontal cortex, the amygdala and the thalamus; and increases in brain activity in regions involving effective appraisal, such as the vmDFC, the nucleus accumbens, and the PAG.

Now what happens in depression? Well, in the field of antidepressant placebo effects, the long-term dynamics of mood and antidepressant responses have not allowed us to have such trial-by-trial manipulation of expectancies. And so instead researchers have used broad brain changes in the context of a randomized control trial or a placebo lead-in phase which has, to some extent, limited the progress of the field.

Now despite these methodological limitations of these studies, they provide important insights about the neural correlates of antidepressant placebo effects. In particular, from studies -- two early on studies we can see the placebo was associated with increased activations broadly in cortical regions and decreased activations in subcortical regions. And these deactivations in subcortical regions were actually larger in patients who were assigned to an SSRI drug treatment.

We also demonstrated that there is similar to pain, antidepressant placebo effects were associated with enhanced endogenous opiate release during placebo administration, predicting the response to open label treatment after ten weeks. And we have also -- we and others have demonstrated that increased connectivity between the salience network and the rostral anterior cingulate during antidepressant placebo effects can actually predict short-term and long-term placebo effects.

Now an important limitation, and as I already mentioned, is that this study is basically the delay mechanism of action of common antidepressant and this low dynamics of mood which really limit the possibility of actively manipulating antidepressant expectancies.

So to address this important gap, we develop a trial-by-trial manipulation of antidepressant expectancies to be used inside of the scanner. And the purpose was really to be able to further disassociate expectancy and mood dynamics during antidepressant placebo effects.

And so the basic structure of this test involved an expectancy condition where subjects are presented with a four-second infusion cue followed by an expectancy rating cue, and a reinforcement condition which consist of 20 seconds of some neurofeedback followed by a mood rating cue. Now the expectancy and the reinforcement condition map onto the classical theories of the placebo effect that I explained earlier.

During the expectancy condition, the antidepressant infusions are compared to periods of calibration where no drug is administered. And during the reinforcement condition, on the other hand, some neurofeedback of positive sign 80% of the time as compared to some neurofeedback of baseline sign 80% of the time. And so this two-by-two study design results in four different conditions. The antidepressant reinforced, the antidepressant not reinforced, the calibration reinforced, and the calibration not reinforced.

And so the cover story is that we tell participants that we are testing the effects of a new fast-acting antidepressant compared to a conventional antidepressant, but in reality, they are both saline. And then we tell them that they will receive multiple infusions of these drugs inside of the scanner while we record their brain activity which we call neurofeedback. So then patients learn that positive neurofeedback compared to baseline is more likely to cause mood improvement. But they are not told that the neurofeedback is simulated.

Then we place an intravenous line for the administration of the saline infusion, and we bring them inside of the scanner. For these kind of experiments we recruit individuals who are 18 through 55 with or without anxiety disorders and have a HAMD depression rating scale greater than 16, consistent with moderate depression. They're antidepressant medication free for at least 25 -- 21 days and then we use consenting procedures that involve authorized deception.

Now, as suspected, behavioral results during this test consistently show that antidepressant expectancies are higher during the antidepressant infusions compared to the calibration, especially when they are reinforced by positive sham neurofeedback. Now mood responses also are significantly higher during positive sham neurofeedback compared to baseline. But this is also enhanced during the administration of the antidepressant infusions.

Now interestingly, these effects are moderated by the present severity such that the effects of the test conditions and the expectancies and mood ratings are weaker in more severe depression even though their overall expectancies are higher, and their overall mood are lower.

Now at a neuron level, what we see is that the presentation of the infusion cue is associated with an increased activation in the occipital cortex and the dorsal attention network suggesting greater attention processing engaged during the presentation of the treatment cue. And similarly, the reinforcement condition revealed increased activations in the dorsal attention network with additional responses in the ventral striatum suggesting that individuals processed the sham positive neurofeedback cue as rewarding.

Now an important question for us was now that we can manipulate acute placebo -- antidepressant placebo responses, can we use this experiment to understand the mechanisms implicated in short-term and long-term antidepressant placebo effects. And so as I mentioned earlier, there was emerging evidence suggesting that placebo analgesic could be explained by computational models, in particular reinforcement learning.

And so we tested the hypothesis that antidepressant placebo effects could be explained by similar models. So as you know, under these theories, learning occurs when an experienced outcome differs from what is expected. And this is called the prediction error. And then the expected value of the next possible outcome is updated with a portion of this prediction error as reflected in this cue learning rule.

Now in the context of our experiment, model predicted expectancies for each of the four trial conditions would be updated every time the antidepressant or the calibration infusion cue is presented and an outcome, whether positive or baseline neurofeedback, is observed based on a similar learning rule.

Now this basic model was then compared against two alternative models. One which included differential learning rates to account for the possibility that learning would depend on whether participants were updating expectancies for the placebo or the calibration. And then an additional model to account for the possibility that subjects were incorporating positive mood responses as mood rewards.

And then finally, we constructed this additional model to allow the possibility of the combination of models two and three. And so using patient model comparison, we found that the model -- the fourth model, model four which included a placebo bias learning in our reinforcement by mood dominated all the other alternatives after correction for patient omnibus risk.

Now we then map the expected value and reward predictions error signals from our reinforcement learning models into our raw data. And what we found was that expected value signals map into the salience network raw responses; whereas reward prediction errors map onto the dorsal attention network raw responses. And so all together, the combination of our model-free and model-based results reveal that the processing of the antidepressant in patient cue increase activation in the dorsal attention network; whereas, the encoding of the expectancies took place in the salience network once salience had been attributed to the cue.

And then furthermore, we demonstrated that the reinforcement learning model predicted expectancies in coding the salience network triggered mood changes that are perceived as reward signals. And then these mood reward signals further reinforce antidepressant expectancies through the information of expectancy mood dynamics defined by models of reinforcement learning, an idea that could possibly contribute to the formation of long-lasting antidepressant placebo effects.

And so the second aim was really -- was going to look at these in particular how to use behavioral neuroresponses of placebo effects to predict long-term placebo effects in the context of a clinical trial. And so our hypothesis was that during placebo administration greater salient attribution to the contextual cue in the salience network would transfer to regions involved in mood regulation to induce mood changes. So in particular we hypothesized that the DMN would play a key role in belief-induced mood regulation.

And why the DMN? Well, we knew that activity in the rostral anterior cingulate, which is a key node of the DMN, is a robust predictor of mood responses to both active antidepressant and placebos, implying its involvement in nonspecific treatment response mechanisms. We also knew that the rostral anterior cingulate is a robust predicter of placebo analgesia consistent with its role in cognitive appraisals, predictions and evaluation. And we also had evidence that the SN to DMN functional connectivity appears to be a predictor of placebo and antidepressant responses over ten weeks of treatment.

And so in our clinical trial, which you can see the cartoon diagram here, we randomized six individuals to placebo or escitalopram 20 milligrams. And this table is just to say there were no significant differences between the two groups in regard to the gender, race, age, or depression severity. But what we found interesting is that there were also no significant differences in the correct belief assignment with 60% of subjects in each group approximately guessing that they were receiving escitalopram.

Now as you can see here, participants showed lower MADR scores at eight weeks in both groups. But there was no significant differences between the two groups. However, when split in the two groups by the last drug assignment belief, subjects with the drug assignment belief improved significantly compared to those with a placebo assignment belief.

And so the next question was can we use neuroimaging to predict these responses? And what we found was at a neural level during expectancy process the salience network had an increased pattern of functional connectivity with the DMN as well as with other regions of the brainstem including the thalamus. Now at the end -- we also found that increased SN to DMN functional connectivity predicted expectancy ratings during the antidepressant placebo fMRI task such that higher connectivity was associated with greater modulation of the task conditions on expectancy ratings.

Now we also found that enhanced functional connectivity between the SN and the DMN predicted the response to eight weeks of treatment, especially on individuals who believed that they were of the antidepressant group. Now this data supports that during placebo administration, greater salient attributions to the contextual cue is encoded in the salience network; whereas belief-induced mood regulation is associated with an increased functional connectivity between the SN and DMN and altogether this data suggest that enhancements to DMN connectivity enables the switch from greater salient attribution to the treatment cue to DMN-mediated mood regulation.

And so finally, and this is going to be brief, but the next question for us was can we modulate these networks to actually enhance placebo-related activity. And in particular, we decided to use theta burst stimulation which can potentiate or depotentiate brain activity in response to brief periods of stimulation. And so in this study participants undergo three counterbalance sessions of TBS with either continuous, intermittent, or sham known to depotentiate, potentiate, and have no effect.

So each TBS is followed by an fMRI session during the antidepressant placebo effect task which happens approximately an hour after stimulation. The inclusive criteria are very similar to all of our other studies. And our pattern of stimulation is pretty straightforward. We do two blocks of TBS. And during the first block stimulation intensity is gradually escalated in 5% increments in order to enhance tolerability. And during the second session the stimulation is maintained constant at 80% of the moderate threshold.

Then we use the modified cTBS session consisting of three stimuli applied at intervals of 30 hertz. We first repeat it at 6 hertz for a total of 600 stimuli in a continuous train of 33.3 seconds. Then we did the iTBS session consist of a burst of three stimuli applied at intervals of 50 hertz with bursts repeated at 5 hertz for a total of 600 stimulus during 192 seconds. We also use a sham condition where 50% of subjects are assigned to sham TBS simulating the iTBS stimulus pattern, and 50% are assigned to sham TBS simulating the cTBS pattern.

Now our target is the DMN which is the cortical target for the dorsal medial -- the cortical target for the DMN -- sorry, the dmPFC which is the cortical target for the DMN. And this corresponds to the -- and we found these effects based on the results from the antidepressant placebo fMRI task.

And so this target corresponds to our neurosynth scalp which is located 30% of the distance from the nasion-to-inion forward from the vertex and 5% left which corresponds to an EEG location of F1. And the connectivity map of these regions actually result in activation of the DMN. Now we can also show here the E-Field map of this target which basically demonstrates supports a nice coverage of the DMN.

And so what we found here is that the iTBS compared to sham and cTBS enhances the effect of the reinforcement condition of mood responses. And we also found that at a neural level iTBS compared to cTBS shows significant greater bold responses during expectancy processing within the DMN with sham responses in the middle but really not significantly different from iTBS. Now, increased bold responses in the ventral medial prefrontal cortex were associated with a greater effect of the task conditions of mood responses.

And so all together our results suggest that first trial-by-trial modulation of antidepressant expectancies effectively disassociates expectancy mood dynamics. Antidepressant expectancies are predicted by models of reinforcement learning and they're encoded in the salience network. We also showed that enhanced SN to DMN connectivity enables the switch from greater salient attribution to treatment cues to DMN-mediated mood regulation, contributing to the formation of acute expectancy-mood interactions and long-term antidepressant placebo effects. And iTBS potentiation of the DMN enhances placebo-induced mood responses and expectancy processing.

With this, I would just like to thank my collaborators that started this work with me at the University of Michigan and mostly the people in my lab and collaborators at the University of Pittsburgh as well as the funding agencies.

CRISTINA CUSIN: Wonderful presentation. Really terrific way of trying to untangle different mechanism in placebo response in depression, which is not an easy feat.

There are no specific questions in the Q&A. I would encourage everybody attending the workshop to please post your question to the Q&A. Every panelist can answer in writing. And then we will answer more questions during the discussion, but please don't hesitate.

I think I will move on to the next speaker. We have only a couple of minutes so we're just going to move on to Dr. Schmidt. Thank you so much. We can see your slides. We cannot hear you.

LIANE SCHMIDT: Can you hear me now?

CRISTINA CUSIN: Yes, thank you.

LIANE SCHMIDT: Thank you. So I'm Liane Schmidt. I'm an INSERM researcher and team leader at the Paris Brain Institute. And I'm working on placebo effects but understanding the appetitive side of placebo effects. And what I mean by that I will try to explain to you in the next couple of slides.

NIMH Staff: Can you turn on your video?

LIANE SCHMIDT: Sorry?

NIMH Staff: Can you please turn on your video, Dr. Schmidt?

LIANE SCHMIDT: Yeah, yes, yes, sorry about that.

So it's about the appetitive side of placebo effects because actually placebo effects on cognitive processes such as motivation and biases and belief updating because these processes actually play also a role when patients respond to treatment. And when we measure placebo effects, basically when placebo effects matter in the clinical setting.

And this is done at the Paris Brain Institute. And I'm working also in collaboration with the Pitie-Salpetriere Hospital Psychiatry department to get access to patients with depression, for example.

So my talk will be organized around three parts. On the first part, I will show you some data about appetitive placebo effects on taste pleasantness, hunger sensations and reward learning. And this will make the bridge to the second part where I will show you some evidence for asymmetrical learning biases that are more tied to reward learning and that could contribute actually or can emerge after fast-acting antidepressant treatment effects in depression.

And why is this important? I will try to link these two different parts, the first and second part, in the third part to elaborate some perspectives on the synergies between expectations, expectation updating through learning mechanisms, motivational mechanisms, motivational processes and drug experiences and which we can -- might harness actually by using computational models such as, for example, risk-reward Wagner models as Marta just showed you all the evidence for this in her work.

The appetitive side of placebo effects is actually known very well from the field of research in consumer psychology and marketing research where price labels, for example, or quality labels can affect decision-making processes and also experiences like taste pleasantness experience. And since we are in France, one of the most salient examples for these kind of effects comes from wine tasting. And many people have shown -- many studies have shown that basically the price of wine can influence how pleasant it tastes.

And we and other people have shown that this is mediated by activation in what is called the brain valuation systems or regions that encode expected and experienced reward. And one of the most prominent hubs in this brain valuation system is the ventral medial prefrontal cortex, what you see here on the SPM on the slide. That can explain, that basically translates these price label effects on taste pleasantness liking. And what is interesting is also that its sensitivity to monetary reward, for example, obtaining by surprise a monetary reward. It activates, basically the vmPFC activates when you obtain such a reward surprisingly.

And the more in participants who activate the vmPFC more in these kind of positive surprises, these are also the participants in which the vmPFC encoded more strongly the difference between expensive and cheap wines, which makes a nice parallel to what we know from placebo hyperalgesia where it has also been shown that the sensitivity of the reward system in the brain can moderate placebo analgesia with participants with higher reward sensitivity in the ventral striatum, for example, another region showing stronger placebo analgesia.

So this is to basically hope to let you appreciate that these effects parallel nicely what we know from placebo effects in the pain and also in disease. So we went further beyond actually, so beyond just taste liking which is basically experiencing rewards such as wine. But what could be -- could placebos also affect motivational processes per se? So when we, for example, want something more.

And one way to study is to study basic motivation such as, for example, hunger. It is long thought, for instance, eating behavior that is conceptualized to be driven by homeostatic markers, hormone markers such as Ghrelin and Leptin that signal satiety and energy stores. And as a function of these different hormonal markers in our blood, we're going to go and look for food and eat food. But we also know from the placebo effects on taste pleasantness that there is a possibility that our higher order beliefs about our internal states not our hormones can influence whether we want to eat food, whether we engage in these types of very basic motivations. And that we tested that, and other people also, that's a replication.

In the study where we gave healthy participants who came into the lab in a fasted state a glass of water. And we told them well, water sometimes can stimulate hunger by stimulating the receptors in your mouth. And sometimes you can also drink a glass of water to kill your hunger. And a third group, a control group was given a glass of water and told it's just water; it does nothing to hunger. And then we asked them to rate how hungry they feel over the course of the experiment. And it's a three-hour experiment. Everybody has fasted. And they have to do this food choice task in an fMRI scanner so they get -- everybody gets hungry over this three hours.

But what was interesting and what you see here on this rain cloud plot is that participants who believed or drank the water suggested to be a hunger killer increased in their hunger rating less than participants who believed the water will enhance their hunger. So this is a nice replication what we already know from the field; other people have shown this, too.

And the interesting thing is that it also affected this food wanting, this motivational process how much you want to eat food. So when people laid there in the fMRI scanner, they saw different food items, and they were asked whether they want to eat it or not for real at the end of the experiment. So it's incentive compatible. And what you see here is basically what we call stimulus value. So how much do you want to eat this food.

And the hunger sensation ratings that I just showed you before parallel what we find here. The people in the decreased hunger suggestion group wanted to eat the food less than in the increased hunger suggestion group, showing that it is not only an effect on subjective self-reports or how you feel your body signals about hunger. It's also about what you would actually prefer, what your subjective preference of food that is influenced by the placebo manipulation. And it's also influencing how your brain valuation system again encodes the value for your food preference. And that's what you see on this slide.

Slide two, you see the ventral medial prefrontal cortex. The yellow boxes that the more yellow they are, the stronger they correlate to your food wanting. And you see on the right side with the temporal time courses of the vmPFC that that food wanting encoding is stronger when people were on the increased hunger suggestion group than in the decreased hunger suggestion group.

So basically what I've showed you here is three placebo effects. Placebo effects on subjective hunger ratings, placebo effects on food choices, and placebo effects on how the brain encodes food preference and food choices. And you could wonder -- these are readouts. So these are behavioral readouts, neural readouts. But you could wonder what is the mechanism behind? Basically what is in between the placebo intervention here and basically the behavior feed and neural readout of this effect.

And one snippet of the answer to this question is when you look at the expectation ratings. For example, expectations have long been shown to be one of the mediators, the cognitive mediators of placebo effects across domains. And that's what we see here, too. Especially in the hunger killer suggestion group. The participants who believed that the hunger -- that the drug will kill their hunger more strongly were also those whose hunger increased less over the course of the experiment experience.

And this moderated activity in the region that you see here, which is called the medial prefrontal cortex, that basically activated when people saw food on the screen and thought about whether they want to eat it or not. And this region activated by that activity was positively moderated by the strength of the expectancy about the glass of water to decrease their hunger. So the more you expect that the water will decrease your hunger, the more the mPFC activates when you see food on the screen.

It's an interesting brain region because it's right between the ventral medial prefrontal cortex that encodes the value, the food preference, and the dorsal lateral prefrontal cortex. And it has been shown by past research to connect to the vmPFC when participants self-control, especially during food decision-making paradigms.

But another mechanism or another way to answer the question about the mechanism of how the placebo intervention can affect this behavior in neural effects is to use computational modelings to better understand the preference formation -- the preference formation basically. And one way is that -- is drift diffusion modeling. So these drift diffusion models come from perceptual research for understanding perception. And they are recently also used to better understand preference formation. And they assume that your preference for a yes or no food choice, for example, is a noisy accumulation of evidence.

And there are two types of evidence you accumulate in these two -- in these decision-making paradigms is basically how tasty and how healthy food is. How much you like the taste, how much you consider the health. And this could influence this loop of your evidence accumulation how rapidly basically you reach a threshold towards yes or no.

It could also be that the placebo and the placebo manipulation could influence this loop. But the model loops test several other hypotheses. It could be that the placebo intervention basically affected also just the threshold like that reflects how carefully you made the decision towards a yes or no choice. It could be your initial bias; that is, basically initially you were biased towards a yes or a no response. Or it could be the nondecision time which reflects more sensory motor integration.

And the answer to this question is basically that three parameters were influenced by the placebo manipulation. Basically how much you integrated healthiness and tastiness in your initial starting bias. So you paid more attention to the healthiness when you believed that you were on a hunger killer. And more the tastiness when you believed that you were on a hunger enhancer. And similarly, you were initially biased towards accepting food more when participants believed they were on a hunger enhancer than on a hunger killer.

Interestingly, so this basically shows that this decision-making process is biased by the placebo intervention and basically also how much you filter information that is most relevant. When you are hungry, basically taste is very relevant for your choices. When you are believing you are less hungry, then you have more actually space or you pay less attention to taste, but you can also pay attention more to healthiness of food.

And so the example that shows that this might be a filtering of expectation-relevant information is to use psychophysiologic interaction analyzers that look basically at the brain activity in the vmPFC, that's our seed region. Where in the brain does it connect when people, when participants see food on a computer screen and have to think about whether they want to eat this food or not?

And what we observed there that's connected to the dlPFC, the dorsal lateral prefrontal cortex region. And it's a region of interest that we localized first to be sure it is actually a region that is inter -- activating through an interference resolution basically when we filter -- have to filter information that is most relevant to a task in a separate Stroop localizer task.

So the vmPFC connects stronger to this dlPFC interference resolution region and this is moderated especially in the decreased hunger suggestion group by how much participants considered the healthiness against the tastiness of food.

To wrap this part up, it's basically that we replicated findings from previous studies about appetitive placebo effects by showing that expectancies about efficiency of a drink can affect hunger sensations. How participants make -- form their food preferences, make food choices. And value encoding in the ventral medial prefrontal cortex.

But we also provided evidence for underlying neurocognitive mechanisms that involve the medial prefrontal cortex that is moderated by the strengths of the hunger expectation. That the food choice formation is biased in the form of attention-filtering mechanism toward expectancy congruent information that is taste for an increased hunger suggestion group, and healthiness for a decreased hunger suggestion group. And this is implemented by regions that are linked to interference resolution but also to valuation preference encoding.

And so why should we care? In the real world, it is not very relevant to provide people with deceptive information about hunger-influencing ingredients of drinks. But studies like this one provide insights into cognitive mechanisms of beliefs about internal states and how these beliefs can affect the interoceptive sensations and also associated motivations such as economic choices, for example.

And this can actually also give us insights into the synergy between drug experiences and outcome expectations. And that could be harnessed via motivational processes. So translated basically via motivational processes. And then through it maybe lead us to better understand active treatment susceptibility.

And I'm going to elaborate on this in the next part of the talk by -- I'm going a little bit far, coming a little bit far, I'm not talking about or showing evidence about placebo effects. But yes -- before that, yes, so basically it is.

Links to these motivational processes have long been suggested actually to be also part of placebo effects or mechanisms of placebo effect. And that is called the placebo-reward hypothesis. And that's based on findings in Parkinson's disease that has shown that when you give Parkinson's patients a placebo but tell them it's a dopaminergic drug, then you can measure dopamine in the brain. And the dopamine -- especially the marker for dopamine, its binding potential decreases. That is what you see here on this PET screen -- PET scan results.

And that suggests that the brain must have released endogenous dopamine. And dopamine is very important for expectations and learning. Basically learning from reward. And clinical benefit is the kind of reward that patients expect. So it might -- it is possible that basically when a patient expects reward clinical benefit, its brain -- their brain releases dopamine in remodulating that region such as the vmPFC or the ventral striatum.

And we have shown this in the past that the behavioral consequence of such a nucleus dopamine release under placebo could be linked to reward learning, indeed. And what we know is that, for example, that Parkinson patients have a deficit in learning from reward when they are off dopaminergic medication. But this normalizes when they are under active dopaminergic medication.

So we wondered if based on these PET studies under placebo, the brain releases dopamine, does this also have behavior consequences on their reward learning ability. And that is what you see here on the screen on the right side on the screen is that the Parkinson patients basically tested on placebo shows similar reward learning abilities as under active drug.

And this again was also underpinned by increased correlation of the ventral medial prefrontal cortex. Again, this hub of the brain valuation system to the learned reward value. That was stronger in the placebo and active drug condition compared to baseline of drug condition.

And I want to make now this -- a link to another type of disease where also the motivation is deficitary, and which is depression. And depression is known to be maintained or is sympathized to be maintained by this triad of very negative beliefs about the world, the future and one's self. Which is very insensitive to belief disconfirming information, especially if the belief disconfirming information is positive, so favorable. And this has been shown by cognitive neuroscience studies to be reflected by a thought of like of good news/bad news bias or optimism biases and belief updating in depression. And this good news/bad news bias is basically a bias healthy people have to consider favorable information that contradicts initial negative beliefs more than negative information.

And this is healthy because it avoids reversing of beliefs. And it also includes a form of motivational process because good news have motivational salience. So it should be more motivating to update beliefs about the future, especially if these beliefs are negative, then when we learn that our beliefs are way too negative and get information about that disconfirms this initial belief. But depressed patients, they like this good news/bad news bias. So we wonder what happens when patients respond to antidepressant treatments that give immediate sensory evidence about being on an antidepressant.

And these new fast-acting antidepressants such as Ketamine, these types of antidepressants that patients know right away whether they got the treatment through dissociative experiences. And so could it be that this effect or is it a cognitive model of depression. So this was the main question of the study. And then we wondered again what is the computational mechanism. And is it linked again also, as shown in the previous studies, to reward learning mechanisms, so biased updating of beliefs. And is it linked to clinical antidepressant effects and also potentially outcome expectations makes the link to placebo effects.

So patients were given the -- were performing a belief updating task three times before receiving Ketamine infusions. And then after first infusion and then one week after the third infusion, each time, testing time we measured the depression with the Montgomery-Asberg Depression Rating Scale. And patients performed this belief updating task where they were presented with different negative life events like, for example, getting a disease, losing a wallet here, for example.

And they were asked to estimate their probability of experiencing this life event in the near future. And they were presented with evidence about the contingencies of this event in the general population, what we call the base rate. And then they had the possibility to update their belief knowing now the base rate.

And this is, for example, a good news trial where participants initially overestimated the chance for losing a wallet and then learn it's much less frequent than they initially thought. Updates, for example, 15%. And in a bad news trial, it's you initially underestimated your probability of experiencing this adverse life event. And if you have a good news/bad news bias, well, you're going to consider this information to a lesser degree than in a good news trial.

And that's what -- exactly what happens in the healthy controls that you see on the left most part of the screen. I don't know whether you can see the models, but basically we have the belief updating Y axis. And this is healthy age-matched controls to patients. And you can see updating of the good news. Updating of the bad news. We tested the participants more than two times within a week. You can see the bias. There is a bias that decreases a little bit with more sequential testing in the healthy controls. But importantly, in the patients the bias is there although before Ketamine treatment.

But it becomes much more stronger after Ketamine treatment. It emerged basically. So patients become more optimistically biased after Ketamine treatment. And this correlates to the MADRS scores. Patients who improve more with treatment are also those who show a stronger good news/bad news bias after one week of treatment.

And we wondered again about the computational mechanisms. So one way to get at this using a Rescoria-Wagner model reward reinforcement learning model that basically assumes that updating is proportional to your surprise which is called the estimation error.

The difference between the initial estimate and the base rate. And this is weighted by learning rate. And the important thing here is the learning rate has got two components, a scaling parameter and an asymmetry parameter. And the asymmetry parameter basically weighs in how much the learning rate varies after good news, after positive estimation error, than after negative estimation errors.

And what we can see that in healthy controls, there is a stronger learning rate for positive estimation errors and less stronger for negative estimation errors translating this good news/bad news bias. It's basically an asymmetrical learning mechanism. And in the patients, the asymmetrical learning is non-asymmetrical before Ketamine treatment. And it becomes asymmetrical as reflected in the learning rates after Ketamine treatment.

So what we take from that is basically that Ketamine induced an optimism bias. But an interesting question is whether -- basically what comes first. Is it basically the improvement in the depression that we measured with the Montgomery-Asberg Depression Rating Scale, or is it the optimism bias that emerged and that triggered basically. Since it's a correlation, we don't know what comes first.

And an interesting side effect or aside we put in the supplement was that in 16 patients, it's a very low sample size, the expectancy about getting better also correlated to the clinical improvement after Ketamine treatment. We have two expectancy ratings here about the efficiency about Ketamine and also what patients expect their intensity of depression will be after Ketamine treatment.

And so that suggested the clinical benefit is kind of in part or synergistically seems to interact with the drug experience that emerges that generates an optimism bias. And to test this more, we continued data collection just on the expectancy ratings. And basically wondered how the clinical improvement after first infusion links to the clinical improvement after third infusion.

And we know from here that patients improve after first infusion are also those that improved after a third infusion. But is it mediated by their expectancy about the Ketamine treatment? And that's what we indeed found is that basically the more patients expected to get better, the more they got better after one week of treatment. But it mediated this link between the first drug experience and the later drug experiences and suggested there might not be an additive effect as other panelist members today already put forward today, it might be synergetic link.

And one way to get at these synergies is basically again use computational models. And this idea has been around although yesterday that basically there could be self-fulfilling prophesies that could contribute to the treatment responsiveness and treatment adherence. And these self-fulfilling prophesies are biased symmetrically learning mechanisms that are more biased when you have positive treatment experiences, initial positive treatment experiences, and then might contribute how you adhere to the treatment in the long term and also how much you benefit from it in the long term. So it's both drug experience and an expectancy.

And so this is nonpublished work where we played with this idea basically using a reinforcement learning model. This is also very inspired by we know from placebo analgesia. Tor and Luana Kuven, they have a paper on showing that self-fulfilling prophecies can be harnessed with biased patient and reinforcement learning models. And the idea of these models is that there are two learning rates, alpha plus and alpha minus. And these learning rates rate differently into the updating of your expectation after drug experience.

LIANE SCHMIDT: Okay, yeah, I'm almost done.

So rate differently on these drug experiences and expectations as a function of whether the initial experience was congruent to your expectations. So a positive experience, then a negative one. And here are some simulations of this model. I'm showing this basically that your expectation is getting more updated the more bias, positively biased you are. Then when you are negatively biased. And these are some predictions of the model concerning depression improvement.

To wrap this up, the conclusion about this is that there seems to be asymmetrical learning that can capture self-fulfilling prophesies and could be a mechanism that translates expectations and drug experiences potentially across domains from placebo hypoalgesia to antidepressant treatment responsiveness. And the open question is obviously to challenge these predictions of these models more with empirical data in pain but also in mood disorders as Marta does and as we do also currently at Cypitria where we test the mechanisms of belief updating biases in depression with fMRI and these mathematical models.

And this has a direct link implication because it could help us to better understand how these fast-acting antidepressants work and what makes patients adhere to them and get responses to them. Thank you for your attention. We are the control-interoception-attention team. And thank you to all the funders.

CRISTINA CUSIN: Fantastic presentation. Thank you so much. Without further ado, let's move on to the next speaker. Dr. Greg Corder.

GREG CORDER: Did that work? Is it showing?

GREG CORDER: Awesome, all right. One second. Let me just move this other screen. Perfect. All right.

Hi, everyone. My name is Greg Corder. I'm an Assistant Professor at the University of Pennsylvania. I guess I get to be the final scientific speaker in this session over what has been an amazing two-day event. So thank you to the organizers for also having me get the honor of representing the entire field of preclinical placebo research as well.

And so I'm going to give a bit of an overview, some of my friends and colleagues over the last few years and then tell you a bit about how we're leveraging a lot of current neuroscience technologies to really identify the cell types and circuits building from, you know, the human fMRI literature that's really honed in on these key circuits for expectations, belief systems as well as endogenous antinociceptive symptoms, in particular opioid cell types.

So the work I'm going to show from my lab has really been driven by these two amazing scientists. Dr. Blake Kimmey, an amazing post-doc in the lab. As well as Lindsay Ejoh, who recently last week just received her D-SPAN F99/K00 on placebo circuitry. And we think this might be one of the first NIH-funded animal projects on placebo. So congratulations, Lindsay, if you are listening.

Okay. So why use animals, right? We've heard an amazing set of stories really nailing down the specific circuits in humans leveraging MRI, fMRI, EEG and PET imaging that give us this really nice roadmap and idea of how beliefs in analgesia might be encoded within different brain circuits and how those might change over times with different types of patient modeling or updating of different experiences.

And we love this literature. We -- in the lab we read it in depth as best as we can. And we use this as a roadmap in our animal studies because we can take advantage of animal models that really allow us to dive deep into the very specific circuits using techniques like that on the screen here from RNA sequencing, electrophysiology really showing that those functional measurements in fMRI are truly existent with the axons projecting from one region to another.

And then we can manipulate those connections and projections using things like optogenetics and chemogenetics that allow us really tight temporal coupling to turn cells on and off. And we can see the effects of that intervention in real time on animal behavior. And that's really the tricky part is we don't get to ask the animals do you feel pain? Do you feel less pain? It's hard to give verbal suggestions to animals.

And so we have to rely on a lot of different tricks and really get into the heads of what it's like to be a small prey animal existing in a world with a lot of large monster human beings around them. So we really have to be very careful about how we design our experiments. And it's hard. Placebo in animals is not an easy subject to get into. And this is reflected in the fact that as far as we can tell, there is only 24 published studies to date on placebo analgesia in animal models.

However, I think this is an excellent opportunity now to really take advantage of what has been the golden age of neuroscience technologies exploding in the last 10-15 years to revisit a lot of these open questions about when are opioids released, are they released? Can animals have expectations? Can they have something like a belief structure and violations of those expectations that lead to different types of predictions errors that can be encoded in different neural circuits. So we have a chance to really do that.

But I think the most critical first thing is how do we begin to behaviorally model placebo in these preclinical models. So I want to touch on a couple of things from some of my colleagues. So on the left here, this is a graph that has been shown on several different presentations over the past two days from Benedetti using these tourniquet pain models where you can provide pharmacological conditioning with an analgesic drug like morphine to increase this pain tolerance.

And then if it is covertly switched out for saline, you can see that there is an elevation in that pain tolerance reflective of something like a placebo analgesic response overall. And this is sensitive to Naloxone, the new opioid receptor antagonist, suggesting endogenous opioids are indeed involved in this type of a placebo-like response.

And my colleague, Dr. Matt Banghart, at UCSD has basically done a fantastic job of recapitulating this exact model in mice where you can basically use morphine and other analgesics to condition them. And so if I just kind of dive in a little bit into Matt's model here.

You can have a mouse that will sit on a noxious hot plate. You know, it's an environment that's unpleasant. You can have contextual cues like different types of patterns on the wall. And you can test the pain behavior responses like how much does the animal flick and flinch and lick and bite and protect itself to the noxious hot plate.

And then you can switch the contextual cues, provide an analgesic drug like morphine, see reductions in those pain behaviors. And then do the same thing in the Benedetti studies, you switch out the morphine for saline, but you keep the contextual cues. So the animal has effectively created a belief that when I am in this environment, when I'm in this doctor's office, I'm going to receive something that is going to reduce my perceptions of pain.

And, indeed, Matt sees a quite robust effect here where this sort of placebo response is -- shows this elevated paw withdrawal latency indicating that there is endogenous nociception occurring with this protocol. And it happens, again, pretty robustly. I mean most of the animals going through this conditioning protocol demonstrate this type of antinociceptive behavioral response. This is a perfect example of how we can leverage what we learn from human studies into rodent studies for acute pain.

And this is also really great to probe the effects of placebo in chronic neuropathic pain models. And so here this is Dr. Damien Boorman who was with Professor Kevin Key in Australia, now with Lauren Martin in Toronto.

And here Damien really amped up the contextual cues here. So this is an animal who has had an injury to the sciatic nerve with this chronic constriction injury. So now this animal is experiencing something like a tonic chronic neuropathic pain state. And then once you let the pain develop, you can have the animals enter into this sort of placebo pharmacological conditioning paradigm where animals will go onto these thermal plates, either hot or cool, in these rooms that have a large amount of visual tactile as well as odorant cues. And they are paired with either morphine or a controlled saline.

Again, the morphine is switched for saline on that last day. And what Damien has observed is that in a subset of the animals, about 30%, you can have these responder populations that show decreased pain behavior which we interpret as something like analgesia overall. So overall you can use these types of pharmacological conditionings for both acute and chronic pain.

So now what we're going to do in our lab is a bit different. And I'm really curious to hear the field's thoughts because all -- everything I'm about to show is completely unpublished. Here we're going to use an experimenter-free, drug-free paradigm of instrumental conditioning to instill something like a placebo effect.

And so this is what Blake and Lindsay have been working on since about 2020. And this is our setup in one of our behavior rooms here. Our apparatus is this tiny little device down here. And everything else are all the computers and optogenetics and calcium imaging techniques that we use to record the activity of what's going on inside the mouse's brain.

But simply, this is just two hot plates that we can control the temperature of. And we allow a mouse to freely explore this apparatus. And we can with a series of cameras and tracking devices plot the place preference of an animal within the apparatus. And we can also record with high speed videography these highly conserved sort of protective recuperative pain-like behaviors that we think are indicative of the negative affect of pain.

So let me walk you through our little model here real quick. Okay. So we call this the placebo analgesia conditioning assay or PAC assay. So here is our two-plate apparatus here. So plate number one, plate number two. And the animal can always explore whichever plate it wants. It's never restricted to one side. And so we have a habituation day, let the animal familiarize itself. Like oh, this is a nice office, I don't know what's about to happen.

And then we have a pretest. And in this pretest, importantly, we make both of these plates, both environments a noxious 45-degree centigrade. So this will allow the animal to form an initial expectation that the entire environment is noxious and it's going to hurt. So both sides are noxious. Then for our conditioning, this is where we actually make one side of the chamber non-noxious. So it's just room temperature. But we keep one side noxious. So now there is a new expectation for the animal that it learns that it can instrumentally move its body from one side to the other side to avoid and escape feeling pain.

And so we'll do this over three days, twice per day. And then on our post tester placebo day we make both environments hot again. So now we'll start the animal off over here and the animals will get to freely choose do they want to go to the side that they expect should be non-noxious? Or what happens? So what happens?

Actually, if you just look at the place preference for this, over the course of conditioning we can see that the animals will, unsurprisingly, choose the environment that is non-noxious. And they spend 100% of their time there basically. But when we flip the plates or flip the conditions such that everything is noxious on the post test day, the animals will still spend a significant amount of time on the expected analgesia side. So I'm going to show you some videos here now and you are all going to become mouse pain behavior experts by the end of this.

So what I'm going to show you are both side by side examples of conditioned and unconditioned animals. And try to follow along with me as you can see what the effect looks like. So on this post test day. Oh, gosh, let's see if this is going to -- here we go. All right. So on the top we have the control animal running back and forth. The bottom is our conditioned animal.

And you'll notice we start the animal over here and it's going to go to the side that it expects it to not hurt. Notice the posture of the animals. This animal is sitting very calm. It's putting its entire body down on the hot plate. This animal, posture up, tail up. It's running around a little bit frantically. You'll notice it start to lick and bite and shake its paws. This animal down here might have a couple of flinches so it's letting you know that some nociception is getting into the nervous system overall.

But over the course of this three-minute test, the animals will rightly choose to spend more time over here. And if we start to quantify these types of behaviors that the animals are doing in both conditions, what we find is that there is actually a pretty significant reduction in these nociceptive behaviors. But it's not across the entire duration of this placebo day or post test day.

So this trial is three minutes long. And what we see is that this antinociceptive and preference choice only exists for about the first 90 seconds of this assay. So this is when the video I just showed, the animal goes to the placebo side, it spends a lot of its time there, does not seem to be displaying pain-like behaviors.

And then around 90 seconds, the animal -- it's like -- it's almost like the belief or the expectation breaks. And at some point, the animal realizes oh, no, this is actually quite hot. It starts to then run around and starts to show some of the more typical nociceptive-like behaviors. And we really like this design because this is really, really amenable to doing different types of calcium imaging, electrophysiology, optogenetics because now we have a really tight timeline that we can observe the changing of neural dynamics at speeds that we can correlate with some type of behavior.

Okay. So what are those circuits that we're interested in overall that could be related to this form of placebo? Again, we like to use the human findings as a wonderful roadmap. And Tor has demonstrated, and many other people have demonstrated this interconnected distributed network involving prefrontal cortex, nucleus accumbens, insula, thalamus, as well as the periaqueductal gray.

And so today I'm going to talk about just the periaqueductal gray. Because there is evidence that there is also release of endogenous opioids within this system here. And so we tend to think that the placebo process and the encoding, whatever that is, the placebo itself is likely not encoded in the PAG. The PAG is kind of the end of the road. It's the thing that gets turned on during placebo and we think is driving the antinociceptive or analgesic effects of the placebo itself.

So the PAG, for anyone who's not as familiar, we like it because it's conserved across species. We look at in a mouse. There's one in a human. So potentially it's really good for translational studies as well. It has a very storied past where it's been demonstrated that the PAG subarchitecture has these beautiful anterior to posterior columns that if you electrically stimulate different parts of PAG, you can produce active versus passive coping mechanisms as well as analgesia that's dependent on opioids as well as endocannabinoids.

And then the PAG is highly connected. Both from ascending nociception from the spinal cord as well as descending control systems from prefrontal cortex as well as the amygdala. So with regard to opioid analgesia. If you micro infuse morphine into the posterior part of the PAG, you can produce an analgesic effect in rodents that is across the entire body. So it's super robust analgesia from this very specific part of the PAG.

If you look at the PAG back there and you do some of these techniques to look for histological indications that the mu opioid receptor is there, it is indeed there. There is a large amount of mu opioid receptors, it's OPRM1. And it's largely on glutamatergic neurons. So the excitatory cells, not the inhibitory cells. They are on some of them.

And as far as E-phys data goes as well, we can see that the mu opioid receptor is there. So DAMGOs and opioid agonist. We can see activation of inhibitory GIRK currents in those cells. So the system is wired up for placebo analgesia to happen in that location. Okay. So how are we actually going to start to tease this out? By finding these cells where they go throughout the brain and then understanding their dynamics during placebo analgesia.

So last year we teamed up with Karl Deisseroth's lab at Stanford to develop a new toolkit that leverages the genetics of the opioid system, in particular the promoter for the mu opioid receptor. And we were able to take the genetic sequence for this promoter and package it into adeno associated viruses along with a range of different tools that allow us to turn on or turn off cells or record their activity. And so we can use this mu opioid receptor promoter to gain genetic access throughout the brain or the nervous system for where the mu opioid receptors are. And we can do so with high fidelity.

This is just an example of our mu opioid virus in the central amygdala which is a highly mu opioid specific area. But so Blake used this tool using the promoter to drive a range of different trans genes within the periaqueductal gray. And right here, this is the G camp. So this is a calcium indicator that allows us to in real time assess the calcium activity of PAG mu opioid cells.

And so what Blake did was he took a mouse, and he recorded the nociceptive responses within that cell type and found that the mu opioid cell types are actually nociceptive. They respond to pain, and they do so with increasing activity to stronger and stronger and more salient and intense noxious stimuli. So these cells are actually nociceptive.

And if we look at a ramping hot plate, we can see that those same mu opioid cell types in the PAG increase the activity as this temperature on this hot plate increases. Those cells can decrease that activity if we infuse morphine.

Unsurprisingly, they express the mu opioid receptor and they're indeed sensitive to morphine. If we give naltrexone to block the mu opioid receptors, we can see greater activity to the noxious stimuli, suggesting that there could be an opioid tone or some type of an endogenous opioid system that's keeping this system in check, that it's repressing its activity. So when we block it, we actually enhance that activity. So it's going to be really important here. The activity of these mu opioid PAG cells correlates with affective measures of pain.

When animals are licking, shaking, biting, when it wants to escape away from noxious stimuli, that's when we see activity within those cells. So this is just correlating different types of behavior when we see peak amplitudes within those cell types. So let me skip that real quick.

Okay. So we have this ability to look and peek into the activity of mu opioid cell types. Let's go back to that placebo assay, our PAC assay I mentioned before. If we record from the PAG on that post test day in an animal that has not undergone conditioning, when the plates are super hot, we see a lot of nocioceptive activity in these cells here. They're bouncing up and down.

But if we look at the activity of the nociception in an animal undergoing placebo, what we see is there's a suppression of neural activity within that first 90 seconds. And this actually does seem to extinguish within the lighter 90 seconds. So kind of tracks along with the behavior of those animals. When they're showing anti nocioceptive behavior, that's when those cells are quiet.

When the pain behavior comes back, that's when those cell types are ramping up. But what about the opioids too? Mu opioid receptor cell type's decreasing activity. What about the opioids themselves here? The way to do this in animals has been to use microdialysis, fantastic technique but it's got some limitations to it. This is a way of sampling peptides in real time and then using liquid chromatography to tell if the protein was present. However, the sampling rate is about 10 minutes.

And in terms of the brain processing, 10 minutes might as well be an eternity. If we're talking about milliseconds here. But we want to know what these cells here and these red dots are doing. These are the enkephaliner cells in the PAG. We needed revolution in technologies. One of those came several years ago from Dr. Lin Tian, who developed some of the first sensors for dopamine. Some of you may have heard of it. It's called D-Light.

This is a version of D-Light. But it's actually an enkephalin opioid sensor. What Lin did to genetically engineer this is to take the delta opioid receptor, highly select it for enkephalin, and then link it with this GFP molecule here such that when enkephalin binds to the sensor it will fluoresce.

We can capture that florescence with microscopes that we implant over the PAG and we can see when enkephalin is being released with subsecond resolution. And so what we did for that is we want to see if enkephalin is indeed being released onto those mu opioid receptor expressing pain encoding neurons in the PAG. What I showed you before is that those PAG neurons, they ramp up their activity as the nociception increases, a mouse standing on a hot plate. We see nociception ramp up. What do you all think happened with the opoids?

It wasn't what we expected. It actually drops. So what we can tell is that there's a basal opioid tone within the PAG, but that as nociception increases, acute nociception, we see a decrease suppression of opioid peptide release.

We think this has to do with stuff that Tor has published on previously that the PAG is more likely involved in updating prediction errors. And this acute pain phenomenon we think is reflective of the need to experience pain to update your priors about feeling pain and to bias the selection of the appropriate behaviors, like affect related things to avoid pain. However, what happens in our placebo assay?

We actually see the opposite. So if we condition animals to expect pain relief within that PAC assay, we actually see an increase from the deltoid sensor suggesting that there is an increase in enkephalin release post conditioning. So there can be differential control of the opioid system within this brain region. So this next part is the fun thing you can do with animals. What if we just bypassed the need to do the placebo assay?

If we know that we just need to cause release of enkephalin within the PAG to produce pain relief, we could just directly do that with optigenetics. So we tried to us this animal that allows us to put a red light sensitive opsin protein into the enkephalinergic interneurons into the PAG.

When we shine red light on top of these cells, they turn on and they start to release their neurotransmitters. These are GABAergic and enkephalinergic. So they're dumping out GABA and now dumping out enkephalin into the ERG. We can visualize that using the Delta Light sensor from Lin Tien.

So here is an example of optogenetically released enkephalin within the PAG over 10 minutes. The weird thing that we still don't fully understand is that this signal continues after the optogenetic stimulation. So can we harness the placebo effect in mice? At least it seems we can. So if we turn on these cells strongly, cause them to release enkephalin and put animals back on these ramping hot plate tests we don't see any changes in the latency to detect pain, but we see specific ablation or reductions in these affective motivational pain like behaviors overall. Moderator: You have one minute remaining.

GREGORY CORDER: Cool. In this last minutes, people are skeptical. Can we actually test these higher order cognitive processes in animals? And for anyone who is not a behavioral preclinical neural scientist, you might not be aware there's an absolute revolution happening in behavior with the use of deep learning modules that can precisely and accurately quantify animal behavior. So this is an example of a deep learning tracking system.

We've built the Light Automated Pain Evaluator that can capture a range of different pain related behaviors fully automated without human intervention whatsoever that can be paired with brain reporting techniques like calcium imaging, that allow us to fit a lot of different computational models to understand what the activity of single neurons might be doing, let's say, in the cingulate cortex that might be driving that placebo response.

We can really start to tie now in at single cell resolution the activity of prefrontal cortex to drive these placebo effects and see if that alters anti nocioceptive behavior endogenously. I'll stop there and thank all the amazing people, Blake, Greg, and Lindsay, who did this work, as well as all of our funders and the numerous collaborators who have helped us do this. So thank you.

CRISTINA CUSIN: Terrific talk. Thank you so much. We're blown away. I'll leave the discussion to our two moderators. They're going to gather some of the questions from the chat and some of their own questions for all the presenters from today and from yesterday as well.

TED KAPTCHUK: Matt, you start gathering questions. I got permission to say a few moments of comments. I wanted to say this is fantastic. I actually learned an amazing amount of things. The amount of light that was brought forward about what we know about placebos and how we can possibly control placebo effects, how we can possibly harness placebo effects.

There was so much light and new information. What I want to do in my four minutes of comments is look to the future. What I mean by that is -- I want to give my comments and you can take them or leave them but I've got a few minutes.

What I want to say is we got the light, but we didn't put them together. There's no way we could have. We needed to be more in the same room. How does this fit in with your model? It's hard to do. What I mean by putting things together is I'll give you an example. In terms of how do we control placebo effects in clinical trials. I not infrequently get asked by the pharmaceutical industry, when you look at our placebo data -- we just blew it. Placebo was good as or always as good as the drug.

And the first thing I say is I want to talk to experts in that disease. I want to know the natural history. I want to know how you made your entry criteria so I can understand regression to the mean.

I want to know what's the relationship of the objective markers and subjective markers so I can begin to think about how much is the placebo response. I always tell them I don't know. If I knew how to reduce -- increase the difference between drug and placebo I'd be a rich man, I wouldn't be an academic. What I usually wind up saying is, get a new drug. And they pay me pretty well for that. And the reason is that they don't know anything about natural history. We're trying to harness something, and I just want to say -- I've done a lot of natural history controls, and that's more interesting than the rest of the experiments because they're unbelievable, the amount of improvement people show entering the trial without any treatment.

I just want to say we need to look at other things besides the placebo effect. We want to control the placebo response in a randomized control trial. I want to say that going forward. But I also want to say that we need a little bit of darkness. We need to be able to say, you know, I disagree with you. I think this other data, and one of the things I've learned doing placebo reach there's a paper that contradicts your paper real quickly and there's lots of contradictory information. It's very easy to say you're wrong, and we don't say it enough.

I want to take one example -- please forgive me -- I know that my research could be said that, Ted, you're wrong. But I just want to say something. Consistently in the two days of talk everyone talks about the increase of the placebo response over time. No one refers to the article published in 2022 in BMJ, first author was Mark Stone and senior author was Irving Kirsch. And they analyzed all FDA Mark Stone is in the Division of Psychiatry at CDER at the FDA. They analyzed all data of placebo controlled trials in major depressive disorder. They had over 230 trials, way more than 70,000 patients, and they analyzed the trend over time, in 1979 to the present, the publication. There was no increase in the placebo effect.

Are they right or are other people right? Nothing is one hundred percent clear right now and we need to be able to contradict each other when we get together personally and say, I don't think that's right, maybe that's right. I think that would help us. And the last thing I want to say is that some things were missing from the conference that we need to include in the future. We need to have ethics. Placebo is about ethics. If you're a placebo researcher in placebo controlled trials, that's an important question:

What are we talking about in terms of compromising ethics? There's no discussion that we didn't have time but in the future, let's do that.

And the last thing I would say is, we need to ask patients what their experience is. I've got to say I've been around for a long time. But the first time I started asking patients what their experiences were, they were in double blind placebo or open label placebo, I did it way after they finished the trial, the trial was over, and I actually took notes and went back and talked to people. They told me things I didn't even know about. And we need to have that in conferences. What I want to say, along those lines, is I feel so much healthier because I'm an older person, and I feel with this younger crowd here is significantly younger than me.

Maybe Matt and I are the same age, I don't know, but I think this is really one of the best conferences I ever went to. It was real clear data. We need to do lots of other things in the future. So with that, Matt, feed me some questions.

MATTHEW RUDORFER: Okay. Thanks. I didn't realize you were also 35. But okay. [LAUGHTER].

MATTHEW RUDORFER: I'll start off with a question of mine. The recent emergence of intravenous ketamine for resistant depression has introduced an interesting methodologic approach that we have not seen in a long time and that is the active placebo. So where the early trials just used saline, more recently we have seen benzodiazapine midazolam, while not mimicking really the full dissociative effect that many people get from ketamine, but the idea is for people to feel something, some kind of buzz so that they might believe that they're on some active compound and not just saline. And I wonder if the panel has any thoughts about the merits of using an active placebo and is that something that the field should be looking into more?

TED KAPTCHUK: I'm going to say something. Irving Kirsch published a meta analysis of H studies that used atropine as a control in depression studies. He felt that it made it difficult to detect a placebo drug difference. But in other meta analysis said that was not true. That was common in the '80s. People started thinking about that. But I have no idea how to answer your question.

MICHAEL DETKE: I think that's a great question. And I think in the presentations yesterday about devices, Dr. Lisanby was talking about the ideal sham. And I think it's very similar, the ideal active placebo would have none of the axia of the drug, of the drug in question, but would have, you know, exactly the same side effects and all other features, and of course that's attractive, but of course we probably would never have a drug that's exactly like that. I think midazolam was a great thing to try with ketamine. It's still not exactly the same. But I'd also add that it's not black and white. It's not like we need to do this with ketamine and ignore it for all of our other drugs. All of our drugs have side effects.

Arguably, if you do really big chunks, like classes of relatively modern antidepressants, antipsychotics and the psychostimulants, those are in order of bigger effect sizes in clinical trials, psychostimulants versus anti psychotics, versus -- and they're also in the order of roughly, I would argue, of unblinding, of functional unblinding. And in terms of more magnitude, Zyprexa will make you hungry. And also speed of onset of some of the adverse effects, stimulants and some of the Type II -- the second generation and beyond -- anti psychotics, they have pretty noticeable side effects for many subjects and relatively rapidly. So I think those are all important features to consider.

CRISTINA CUSIN: Dr. Schmidt?

LIANE SCHMIDT: I think using midazolam could give, like, some sensory sensations so the patients actually can say there's some effect on the body like immediately. But this raises actually a question whether these dissociations we observe in some patients of ketamine infusions we know have, will play a role for the antidepressant response. It's still an open question. So I don't have the answer to that question. And I think with midazolam doesn't really induce dissociations. I don't know, maybe you can isolate the dissociations you get on ketamine. But maybe even patients might be educated, expecting scientific reaction experiences and basically when they don't have -- so they make the midazolam experience something negative. So yeah, just self fulfilling prophesies might come into play.

CRISTINA CUSIN: I want to add for five seconds. Because I ran a large ketamine clinic. We know very little about cyto placebo maintaining an antidepressant response while the dissociation often wears off over time. It's completely separate from the anti depressant effect. We don't have long term placebo studies. The studies are extremely short lived and we study the acute effect. But we don't know how to sustain or how to maintain, what's the role of placebo effect in long term treatments. So that's another field that really is open to investigations. Dr. Rief.

WINFRIED RIEF: Following up on the issue of active placebos. I just want to mention that we did a study comparing active placebos to passive placebos and showing that active placebos are really more powerful. And I think the really disappointing part of this news is that it questions the blinding of our typical RCTs comparing antidepressants versus placebos because many patients who are in the active group or the tracked group, they perceive these onset effects and this will further boost the placebo mechanisms in the track group that are not existing in the passive placebo group. This is a challenge that further questions the validity of our typical RCTs.

CRISTINA CUSIN: Marta.

MARTA PECINA : Just a quick follow up to what Cristina was saying, too, that we need to clarify whether we want to find an active control for the dissociative effects or for the antidepressive effects. I think the approach will be very different. And this applies to ketamine but also psychodelics because we're having this discussion as well. So when thinking about how to control for or how to blind or how we just -- these treatments are very complicated. They have multiple effects. We just need to have the discussion of what are we trying to blind because the mechanism of action of the blinding drug will be very different.

TED KAPTCHUK: Can I say something about blinding? Robertson, who is the author of the 1970 -- no -- 1993 New England Journal paper saying that there's no that the placebo effect is a myth.

In 2022, published in BMJ, the largest -- he called it a mega meta analysis on blinding. And he took 144 randomized control trials that included nonblinded evidence on the drug versus blinded evidence of the drug. I'm not going to tell you the conclusion because it's unbelievable. But you should read it because it really influences -- it would influence what we think about blinding. That study was just recently replicated on a different set of patients with procedures in JAMA Surgery three months ago. And blinding like placebo is more complicated than we think. That's what I wanted to say.

MATTHEW RUDORFER: Another clinical factor that's come up during our discussion has been the relationship of the patient to the provider that we saw data showing that a warm relationship seemed to enhance therapeutic response, I believe, to most interventions. And I wonder what the panel thinks about the rise on the one hand of shortened clinical visits now that, for example, antidepressants are mostly given by busy primary care physicians and not specialists and the so called med check is a really, kind of, quickie visit, and especially since the pandemic, the rise of telehealth where a person might not ever even meet their provider in person, and is it possible we're on our way to where a clinical trial could involve, say, mailing medication every week to a patient, having them do their weekly ratings online and eliminating a provider altogether and just looking at the pharmacologic effect?

I mean, that probably isn't how we want to actually treat people clinically, but in terms of research, say, early phase efficacy, is there merit to that kind of approach?

LUANA COLLOCA: I'll comment on this, Dr. Rudorfer. We're very interested to see how the telemedicine or virtual reality can affect placebo effects, and we're modeling in the lab placebo effects induced via, you know, in person interaction.

There's an Avatar and virtual reality. And actually we found placebo effects with both the settings. Or whether, when we look at empathy, the Avatar doesn't elicit any empathy in the relationship. We truly need the in person connection to have empathy. So that suggests that our outcome that are affected by having in person versus telemedicine/para remote interactions, but yet the placebo effects persist in both the settings. The empathy is differently modulated and the empathy mediated, interestingly in our data, placebo effects only in the in person interactions. There is still a value in telemedicine. Effects that bypass empathy completely in competence.

MATTHEW RUDORFER: Dr. Hall.

KATHRYN HALL: Several of the large studies, like the Women's Health Study, Physicians' Health Study and, more recently, Vital, they did exactly that, where they mail these pill packs. And I mean, the population, obviously, is clinicians. So they are very well trained and well behaved. And they follow them for years but there's very little contact with the providers, and you still have these giant -- I don't know if you can call them placebo effects -- but certainly many of these trials have not proven to be more effective, the drugs they're studying, than placebo.

MATTHEW RUDORFER: Dr. Atlas.

LAUREN ATLAS: I wanted to chime in briefly on this important question. I think that the data that was presented yesterday in terms of first impressions of providers is relevant for this because it suggests that even when we use things like soft dot (phonetic) to select physicians and we have head shots (phonetic), that really we're making these decisions about who to see based on these kinds of just first impressions and facial features and having the actual interactions by providers is critical for sort of getting beyond that kind of factor that may drive selection. So I think if we have situations where there's reduced chances to interact, first of all, people are bringing expectations to the table based on what they know about the provider and then you don't really have the chance to build on that without the actual kind of therapeutic alliance. That's why I think, even though our study was done in an artificial setting, it really does show how we make these choices when there are bios for physicians and things available for patients to select from. I think there's a really important expectation being brought to the table before the treatment even occurs.

MATTHEW RUDORFER: Thanks. Dr. Lisanby.

SARAH “HOLLY” LISANBY: Thanks for raising this great question, Matt. I have a little bit of a different take on it. Equity in access to mental health care is a challenge. And the more that we can leverage technology to provide and extend the reach of mental health care the better. And so telemedicine and telepsychiatry, we've been thrust into this era by the pandemic but it existed before the pandemic as well. And it's not just about telepsychotherapy or teleprescription from monitoring pharmacotherapy, but digital remote neuromodulation is also a thing now. There are neuromodulation interventions that can be done at home that are being studied, and so there have been trials on transcranial direct current stimulation at home with remote monitoring. There are challenges in those studies differentiating between active and sham. But I think you're right in that we may have to rethink how do we control remote studies when the intensity of the clinician contact is very different, but I do think that we should explore these technologies so that we can extend the reach and extend access to research and to care for people who are not able to come into the research lab setting.

TED KAPTCHUK: May I add something on this? It's also criticizing myself. In 2008, I did this very nice study showing you could increase the doctor/patient relationship. And as you increase it, the placebo effect got bigger and bigger, like a dose response. A team in Korea that I worked with replicated that. I just published that replication.

The replication came out with the exact opposite results. The less doctor/patient relationship, the less intrusive, the less empathy got better effects. We're dealing with very complicated culturally constructed issues, and I just want to put it out there, the sand is soft. I'm really glad that somebody contradicted a major study that I did.

LUANA COLLOCA: Exactly. The central conference is so critical, what we observed in one context in one country, but even within the same in group or out group can be completely different in Japan, China or somewhere else. So the Americas, South Africa. So we need larger studies and more across country collaborations.

MATTHEW RUDORFER: Dr. Schmidt.

LIANE SCHMIDT: I just wanted to raise a point not really like -- it's more like a comment, like there's also very interesting research going on in the interactions between humans and robots, and usually humans treat robots very badly. And so I wonder what could be like -- here we focus on very human traits, like empathy, competence, what we look at. But when it comes to artificial intelligence, for example, and when we have to interact with algorithms, basically, like all these social interactions might completely turn out completely different, actually, and all have different effects on placebo effects. Just a thought.

MATTHEW RUDORFER: Dr. Rief.

WINFRIED RIEF: Yesterday, I expressed a belief for showing more warmth and competence, but I'll modify it a little bit today because I think the real truth became quite visible today, and that is that there is an interaction between these non specific effect placebo effects and the track effect. In many cases, at least. We don't know whether there are exceptions from this rule, but in many cases we have an interaction. And to learn about the interaction, we instead need study designs that modulate track intake versus placebo intake, but they also modulate the placebo mechanisms, the expectation mechanisms, the context of the treatment. And only if we have these 2 by 2 designs, modulating track intake and modulating context and psychological factors, then we learn about the interaction. You cannot learn about the interaction if you modulate only one factor.

And, therefore, I think what Luana and others have said that interact can be quite powerful and effective in one context but maybe even misleading in another context. I think this is proven. We have to learn more about that. And all the studies that have been shown from basic science to application that there could be an interaction, they're all indicating this line and to this necessity that we use more complex designs to learn about the interaction.

MATTHEW RUDORFER: Yes. And the rodent studies we've seen, I think, have a powerful message for us just in terms of being able to control a lot of variables that are just totally beyond our control in our usual human studies. It always seemed to me, for example, if you're doing just an antidepressant versus placebo trial in patients, well, for some people going into the clinic once a week to get ratings, that might be the only day of the week that they get up and take a shower, get dressed, have somebody ask them how they're doing, have some human interaction. And so showing up for your Hamilton rating could be a therapeutic intervention that, of course, we usually don't account for in the pharmacotherapy trial. And the number of variables really can escalate in a hurry when we look at our trials closely.

TED KAPTCHUK: Tor wants to say something.

TOR WAGER: Thanks, Ted.

I wanted to add on to the interaction issue, which came up yesterday, which Winfried and others just commented on, because it seems like it's really a crux issue. If the psychosocial or expectation effects and other things like that are entangled with specific effects so that one can influence the other and they might interact, then, yeah, we need more studies that independently manipulate specific drug or device effects and other kinds of psychological effects independently. And I wanted to bring this back up again because this is an idea that's been out here for a long time. I think the first review on this was in the '70s, like '76 or something, and it hasn't really been picked up for a couple of reasons. One, it's hard to do the studies. But second, when I talk to people who are in industry and pharma, they are very concerned about changing the study designs at all for FDA approval.

And since we had some, you know, FDA and regulatory perspectives here yesterday, I wanted to bring that up and see what people think, because I think that's been a big obstacle. And if it is, then that may be something that would be great for NIH to fund instead of pharma companies because then there's a whole space of drugs, psychological or neurostimulation psychological interactions, that can be explored.

MATTHEW RUDORFER: We also had a question. Yesterday there was discussion in a naloxone trial in sex differences in placebo response. And wonder if there's any further thoughts on studies of sex differences or diversity in general in placebo trials. Yes.

LUANA COLLOCA: We definitely see sex differences in placebo effect, and I show also, for example, women responded to arginine vasopressin in a way that we don't observe in men.

But also you asked about diversity. Currently actually in our paper just accepted today where we look at where people are living, the Maryland states, and even the location where they are based make a difference in placebo effects. So people who live in the most distressed, either the greatest Baltimore area, tended to have lower placebo effects as compared to a not distressful location. And we define that the radius of the criteria and immediately it's a race but we take into account the education, the income and so on. So it is interesting because across studies consistently we see an impact of diversity. And in that sense, I echo, listen to the comment that we need to find a way to reach out to these people and truly improve access and the opportunity for diversity. Thank you for asking.

MATTHEW RUDORFER: Thank you. Another issue that came up yesterday had to do with the pharmacogenomics. And there was a question or a question/comment about using candidate approaches and are they problematic.

KATHRYN HALL: What approaches.

MATTHEW RUDORFER: Candidate genes.

KATHRYN HALL: I think we have to start where we are. I think that the psychiatric field has had a really tough time with genetics. They've invested a lot and, sadly, don't have as much to show for it as they would like to. And I think that that has really tainted this quest for genetic markers of placebo and related studies, these interaction factors. But it's really important to not, I think, to use that to stop us from looking forward and identifying what's there. Because when you start to scratch the surface, there are interactions. You can see them. They're replete in the literature. And what's really fascinating is everybody who finds them, they don't see them when they report their study. And even some of these vasopressin studies, not obviously, Tor, yours, but I was reading one the other day where they had seen tremendous differences by genetics in response to arginine vasopressin. And they totally ignored what they were seeing in placebo and talked about who responds to drug. And so I think that not only do we need to start looking for what's happening, we need to start being more open minded and paying attention to what we're seeing in the placebo arm and accounting for that, taking that into account to understand what we're seeing across a trial in total.

CRISTINA CUSIN: I'll take a second to comment on sufficient selection and trying to figure out, depending on the site who are the patients who went there, treatment and depression clinical trial. If we eliminate from the discussion professional patient and we think about the patients who are more desperate, patients who don't have access to care, patients who are more likely to have psychosocial stressors or the other extreme, there are patients who are highly educated. The trials above and they search out, but they're certainly not representative of the general populations we see in the clinical setting.

They are somewhat different. And then if you think about the psychedelics trial, they go from 5,000 patients applying for a study and the study ends up recruiting 20, 30. So absolutely not representative of the general population we see in terms of diversity, in terms of comorbidities, in terms of psychosocial situations. So that's another factor that adds to the complexity of differentiating what happens in the clinical setting versus artificial setting like a research study. Tor.

MATTHEW RUDORFER: The question of who enters trials and I think the larger issue of diagnosis in general has, I think, really been a challenge to the field for many years. Ted and I go back a ways, and just looking at depression, of course, has dominated a lot of our discussion these last couple of days, with good reason. Now I realize the good database, my understanding is that the good database of placebo controlled trials go back to the late '90s, is what we heard yesterday. And if you go back further, the tricyclic era not only dealt with different medications, which we don't want to go back to, but if you think about practice patterns then, on the one hand, the tricyclics, most nonspecialists steered clear of, they required a lot of hands on. They required titration slowly up. They had some concerning toxicities, and so it was typical that psychiatrists would prescribe them but family docs would not. And that also had the effect of a naturalistic screening, that is, people would have to reach a certain level of severity before they were referred to a psychiatrist to get a prescription for medication.

More mildly ill people either wound up, probably inappropriately, on tranquilizers or no treatment at all and moderately to severely ill people wound up on tricyclics, and of course inpatient stays were common in those days, which again was another kind of screening. So it was the sort of thing, I mean, in the old days I heard of people talk about, well, you could, if you go to the inpatient board, you could easily collect people to be in clinical trial and you kind of knew that they were vetted already. That they had severe depression, the general sense was that the placebo response would be low. Though there's no real evidence for that. But the thing is, once we had the SSRIs on the one hand, the market vastly expanded because they're considered more broad spectrum. People with milder illness and anxiety disorders now are appropriate candidates and they're easier to dispense. The concern about overdose is much less, and so they're mostly prescribed by nonspecialists. So it's the sort of thing where we've seen a lot of large clinical trials where it doesn't take much to reach the threshold for entry, being if I go way back and this is just one of my personal concerns over many years the finer criteria, which I think were the first good set of diagnostic criteria based on data, based on literature, those were published in 1972 to have a diagnosis of major depression, called for four weeks of symptoms. Actually, literally, I think it said one month.

DSM III came out in 1980 and it called for two weeks of symptoms. I don't know -- I've not been able to find any documentation of how the one month went to two weeks, except that the DSM, of course, is the manual that's used in clinical practice. And you can understand, well, you might not want to have too high a bar to treat people who are seeking help. But I think one of the challenges of DSM, it was not meant as a research manual. Though that's often how it's used. So ever since that time, those two weeks have gotten reified, and so my point is it doesn't take much to reach diagnostic criteria for DSM, now, 5TR, major depression. So if someone is doing a clinical trial of an antidepressant, it is tempting to enroll people who meet, honestly meet those criteria but the criteria are not very strict. So I wonder whether that contributes to the larger placebo effect that we see today.

End of soapbox. The question -- I'd like to revisit an excellent point that Dr. Lisanby raised yesterday which has to do with the research domain criteria, the RDOC criteria. I don't know if anyone on the panel has had experience in using that in any trials and whether you see any merit there. Could RDOC criteria essentially enrich the usual DSM type clinical criteria in terms of trying to more finely differentiate subtypes of depression, might respond differently to different treatments.

MODERATOR: I think Tor has been patient on the hand off. Maybe next question, Tor, I'm not sure if you had comments on previous discussion.

TOR WAGER: Sure, thanks. I wanted to make a comment on the candidate gene issue. And I think it links to what you were just saying as well, doctor, in a sense. I think it relates to the issue of predicting individual differences in placebo effects and using that to enhance clinical trials, which has been really sort of a difficult issue. And in genetics, I think what's happened, as many of us know, is that there were many findings on particular candidate genes, especially comped and other particular set of genes in Science and Nature, and none of those really replicated when larger GWA studies started being done. And the field of genetics really focused in on reproducibility and replicability in one of our sample sizes. So I think my genetics colleagues tell me something like 5,000 is a minimum for even making it into their database of genetic associations. And so that makes it really difficult to study placebo effects in sample sizes like that. And at the same time, there's been this trend in psychology and in science, really, in general, towards reproducibility and replicability that probably in part are sort of evoked by John Ioannidis's provocative claims that most findings are false, but there's something really there.

There's been many teams of people who have tried to pull together, like Brian Nosek's work with Open Science Foundation, and many lab studies to replicate effects in psychology with much higher power. So there's this sort of increasing effort to pull together consortia to really test these things vigorously. And I wonder if -- we might not have a GWA study of placebo effects in 100,000 people or something, which is what would convince a geneticist that there's some kind of association. I'm wondering what the ways forward are, and I think one way is to increasingly come together to pull studies or do larger studies that are pre registered and even registered reports which are reviewed before they're published so that we can test some of these associations that have emerged in these what we call early studies of placebo effects.

And I think if we preregister and found something in sufficiently large and diverse samples, that might make a dent in convincing the wider world that essentially there is something that we can use going forward in clinical trials. And pharma might be interested in, for example, as well. That's my take on that. And wondering what people think.

KATHRYN HALL: My two cents. I completely agree with you. I think the way forward is to pull our resources to look at this and not simply stop -- I think when things don't replicate, I think we need to understand why they don't replicate. I think there's a taboo on looking beyond, if you prespecified it and you don't see it, then it should be over. I think in at least this early stage, when we're trying to understand what's happening, I think we need to allow ourselves deeper dives not for action but for understanding.

So I agree with you. Let's pull our resources and start looking at this. The other thing I would like to point out that's interesting is when we've looked at some of these clinical trials at the placebo arm, we actually learn a lot about natural history. We just did one in Alzheimer's disease and in the placebo arm the genome wide significant hit was CETP, which is now a clinical target in Alzheimer's disease. You can learn a lot by looking at the placebo arms of these studies not just about whether or not the drug is working or how the drug is working, but what's happening in the natural history of these patients that might change the effect of the drug.

TED KAPTCHUK: Marta, did you have something to say; you had your hand up.

MARTA PECINA: Just a follow up to what everybody is saying. I do think the issue of individualability is important. I think that one thing that maybe kind of explains some of the things that was also saying at the beginning that there's a little bit of lack of consistency or a way to put all of these findings together. The fact that we think about it as a one single placebo effect and we do know that there's not one single placebo effect, but even within differing clinical conditions is the newer value placebo effect the same in depression as it is in pain?

Or are there aspects that are the same, for example, expectancy processing, but there's some other things that are very specific to the clinical condition, whether it's pain processing, mood or some others. So I think we face the reality of use from a neurobiology perspective that a lot of the research has been done in pain and still there's very little being done at least in psychiatry across many other clinical conditions that we just don't know. And we don't really even know if the placebo how does the placebo effect look when you have both pain and depression, for example?

And so those are still very open questions that kind of reflect our state, right, that we're making progress but there's a lot to do.

TED KAPTCHUK: Winfried, did you want to say something? You have your hand up.

WINFRIED RIEF: I wanted to come back to the question of whether we really understand this increase of placebo effects. I don't know whether you have (indiscernible) for that. But I'm more like a scientist I can't believe that people are nowadays more reacting to placebos than they did 20 years ago. So there might be other explanations for this effect, like we changed the trial designs. We have more control visits maybe nowadays compared to 30 years ago, but there could be also other factors like publication bias which was maybe more frequent, more often 30 years ago than it is nowadays with the need for greater registration. So there are a lot of methodological issues that could explain this increase of placebo effects or of responses in the placebo groups. I would be interested whether you think that this increase is well explained or what your explanations are for this increase.

TED KAPTCHUK: Winfried, I want to give my opinion. I did think about this issue. I remember the first time it was reported in scientists in Cleveland, 40, 50 patients, and I said, oh, my God, okay, and the newspapers had it all over: The placebo effect is increasing. There's this boogie man around, and everyone started believing it. I've been consistently finding as many papers saying there's no -- I've been collecting them. There's no change over time there are changes over time. When I read the original article, I said, of course there's differences. The patients that got recruited in 1980 were different than the patients in 1990 or 2010. They were either more chronic, less chronic.

They were recruited in different ways, and that's really an easy explanation of why things change. Natural history changes. People's health problems are different, and I actually think that the Stone's meta analysis with 70,033 patients says it very clearly. It's a flat line from 1979. And the more data you have, the more you have to believe it. That's all. That's my personal opinion. And I think we actually are very deeply influenced by the media. I mean, I can't believe this:

The mystery of the placebo. We know more about placebo effects at least compared to many drugs on the market. Thanks my opinion. Thanks, Winfried, for letting me say it.

MATTHEW RUDORFER: Thanks, Ted.

We have a question for Greg. The question is, I wonder what the magic of 90 seconds is? Is there a physiologic basis to the turning point when the mouse changes behavior?

GREGORY CORDER: I think I addressed it in a written post somewhere. We don't know. We see a lot of variability in those animals. So like in this putative placebo phase, some mice will remain on that condition side for 40 seconds, 45 seconds, 60 seconds. Or they'll stay there the entire three minutes of the test. We're not exactly sure what's driving the difference in those different animals. These are both male and females. We see the effect in both male and female C57 black six mice, a genetically inbred animal. We always try to restrict the time of day of testing. We do reverse light testing. This is the animal wake cycle.

And there are things like dominance hierarchies within the cages, alpha versus betas. They may have different levels of pain thresholds. But the breaking of whatever the anti nocioceptive effect is they're standing on a hot plate for quite a long time. At some point those nociceptors in the periphery are going to become sensitized and signal. And to some point it's to the animal's advantage to pay attention to pain. You don't want to necessarily go around not paying attention to something that's potentially very dangerous or harmful to you. We would have to scale up the number of animals substantially I think, to really start parse out what the difference is that would account for that. But that's an excellent point, though.

MATTHEW RUDORFER: Carolyn.

CAROLYN RODRIGUEZ: I want to thank all today's speakers and wonderful presentations today. I just wanted to just go back for a second to Dr. Pecina's point about thinking about a placebo effect is not a monolith and also thinking about individual disorders.

And so I'm a clinical trialist and do research in obsessive compulsive disorder, and a lot of the things that are written in the literature meta analysis is that OCD has one of the lowest placebo rates. And so, you know, from what we gathered today, I guess to turn the question on its head is, is why is that, is that the case, why is that the case, and does that say something about OCD pathology, and what about it? Right? How can we really get more refined in terms of different domains and really thinking about the placebo effect.

So just want to say thank you again and to really having a lot of food for thought.

MATTHEW RUDORFER: Thanks. As we're winding down, one of the looming questions on the table remains what are research gaps and where do you think the next set of studies should go. And I think if anyone wants to put some ideas on the table, they'd be welcome.

MICHAEL DETKE: One of the areas that I mentioned in my talk that is hard for industry to study, or there's a big incentive, which is I talked about having third party reviewers review source documents and videos or audios of the HAM D, MADRS, whatever, and that there's not much controlled evidence.

And, you know, it's a fairly simple design, you know, within our largest controlled trial, do this with half the sites and don't do it with the other half.

Blinding isn't perfect. I haven't thought about this, and it can probably be improved upon a lot, but imagine you're the sponsor who's paying the $20 million in three years to run this clinical trial. You want to test your drug as fast as you possibly can. You don't want to really be paying for this methodology.

So that might be -- earlier on Tor or someone mentioned there might be some specific areas where this might be something for NIH to consider picking up. Because that methodology is being used in hundreds of trials, I think, today, the third party remote reviewer. So there's an area to think about.

MATTHEW RUDORFER: Thanks. Holly.

SARAH “HOLLY” LISANBY: Yeah. Carolyn just mentioned one of the gap areas, really trying to understand why some disorders are more amenable to the placebo response than others and what can that teach us. That sounds like a research gap area to me.

Also, throughout these two days we've heard a number of research gap areas having to do with methodology, how to do placebos or shams, how to assess outcome, how to protect the blind, how do you select what your outcome measures should be.

And then also today my mind was going very much towards what can preclinical models teach us and the genetics, the biology of a placebo response, the biogender line, individual differences in placebo response.

There may be clues there. Carolyn, to your point to placebo response being lower in OCD, and yet there are some OCD patients who respond, what's different about them that makes them responders?

And so studies that just look at within a placebo response versus nonresponse or gradation response or durability response and the mechanisms behind that.

These are questions that I think may ultimately facilitate getting drugs and devices to market, but certainly are questions that might be helpful to answer at the research stage, particularly at the translational research stage, in order to inform the design of pivotal trials that you would ultimately do to get things to market.

So it seems like there are many stages before getting to the ideal pivotal trial. So I really appreciate everyone's input. Let me stop talking because I really want to hear what Dr. Hall has to say.

KATHRYN HALL: I wanted to just come back for one of my favorite gaps to this question increasing the placebo effect. I think it's an important one because so many trials are failing these days. And I think it's not all trials are the same.

And what's really fascinating to me is that you see in Phase II clinical trials really great results, and then what's the first thing you do as a pharma company when you got a good result? You get to put out a press release.

And what's the first thing you're going to go do when you enroll in a clinical trial? You're going to read a press release. You're going to read as much as you can about the drug or the trial you're enrolling in. And how placebo boosting is it going to be to see that this trial had amazing effects on this condition you're struggling with.

If lo and behold we go to Phase III, and you can -- we're actually writing a paper on this, how many times we see the words "unexpected results," and I think we saw them here today, today or yesterday. Like, this should not be unexpected. When your Phase III trial fails, you should not be surprised because this is what's happening time and time again.

And I think some of the -- yeah, I agree, Ted, it's like this is a modern time, but there's so much information out there, so much information to sway us towards placebo responses that I think that's a piece of the problem. And finding out what the problem is I think is a really critical gap.

MATTHEW RUDORFER: Winfried.

WINFRIED RIEF: Yeah. May I follow up in that I think it fits quite nicely to what has been said before, and I want to direct I want to answer directly to Michael Detke.

On first glance, it seems less expensive to do the trials the way we do it with one placebo group and one drug arm, and we try to keep the context constant. But this is the problem. We have a constant context without any variation, so we don't learn under which context conditions is this drug really effective and what are the context conditions the drug might not be effective at all.

And therefore I think the current strategy is more like a lottery. It's really by chance it can happen that you are in this little window where the drug can show the most positive effectivity, but it can also be that you are in this little window or the big window where the drug is not able to show its effectivity.

And therefore I think, on second glance, it's a very expensive strategy only to use one single context to evaluate a drug.

MATTHEW RUDORFER: If I have time for--

TED KAPTCHUK: Marta speak, and then Liane should speak.

MARTA PECINA: I just wanted to add kind of a minor comment here, which is this idea that we're going to have to move on from the idea that giving someone a placebo is enough to induce positive expectancies and the fact that expectancies evolve over time.

So at least in some of the data that we've shown, and it's a small sample, but still we see that 50% of those subjects who are given a placebo don't have drug assignment beliefs. And so that is a very large amount of variability there that we are getting confused with everything else.

And so I do think that it is really important, whether in clinical trials, in research, to really come up with very and really develop new ways of measuring expectancies and allow expectancies to be measured over time. Because they do change. We have some prior expectancies, and then we have some expectancies that are learned based on experience. And I do think that this is an area of improvement that the field could improve relatively easily, you know, assess expectancies better, measure expectancies better.

TED KAPTCHUK: Liane, why don't you say something, and Luana, and then Cristina.

LIANE SCHMIDT: So I wanted to -- maybe one -- another open gap is like about the cognition, like what studying placebo, how can it help us to better understand human reasoning, like, and vice versa, actually, all the biases we have, these cognitive processes like motivation, for example, or memory, and yet all the good news about optimism biases, how do they contribute to placebo effects on the patient side but also on the clinician side when the clinicians have to make diagnosis or judge, actually, treatment efficiency based on some clinical scale.

So basically using like tools from cognition, like psychology or cognitive neuroscience, to better understand the processes, the cognitive processes that intervene when we have an expectation and behavior reach out, a symptom or neural activation, what comes in between, like how is it translated, basically, from cognitive predictability.

LUANA COLLOCA: I think we tended to consider expectation as static measurement when in reality we know that what we expect at the beginning of this workshop is slightly different by the end of what we are hearing and, you know, learning.

So expectation is a dynamic phenomenon, and the assumption that we can predict placebo effects with our measurement of expectation can be very limiting in terms of, you know, applications. Rather, it is important to measure expectation over time and also realize that there are so many nuance, like Liane just mentioned, of expectations, you know.

There are people who say I don't expect anything, I try everything, or people who say, oh, I truly want, I will be I truly want to feel better. And these also problematic patients because having an unrealistic expectation can often destroy, as I show, with a violation of expectancies of placebo effects.

TED KAPTCHUK: Are we getting close? Do you want to summarize? Or who's supposed to do that? I don't know.

CRISTINA CUSIN: I think I have a couple of minutes for remarks. There's so much going on, and more questions than answers, of course.

That has been a fantastic symposium, and I was trying to pitch some idea about possibly organizing a summit with all the panelists, all the presenters, and everyone else who wants to join us, because I think that with a coffee or a tea in our hands and talking not through a Zoom video, we could actually come up with some great idea and some collaboration projects.

Anyone who wants to email us, we'll be happy to answer. And we're always open to collaborating and starting a new study, bouncing off each other new ideas. This is what we do for a living. So we're very enthusiastic about people asking difficult questions.

And some of the questions that are ongoing and I think would be future areas is what we were talking a few minutes ago, we don't know if a placebo responder in a migraine study, for example, would be a placebo responder of depression study or IBS study. We don't know if this person is going to be universal placebo responder or is the context include the type of disease they're suffering from so it's going to be fairly different, and why some disorders have lower placebo response rate overall compared to others. Is that a chronicity, a relaxing, remitting disorder, has higher chance of placebo because the system can be modulated, versus a disorder that is considered more chronic and stable? A lot of this information is not known in the natural history.

Also comes to mind the exact trial it is because we almost never have a threshold for number of prior episodes of depression to enter a trial or how chronic has it been or years of depression or other factors that can clearly change our probability of responding to a treatment.

We heard about methodology for clinical trial design and how patients could be responsive to placebo responses or sham, responsive to drug. How about patients who could respond to both? We have no idea how many of those patients are undergoing a trial, universal responders, unless we do a crossover. And we know that crossover is not a popular design for drug trials.

So we need to figure out also aspects of methodology, how to assess outcome, what's the best way to assess the outcome that we want, is it clinically relevant, how to protect the blind aspect, assess expectations and how expectations change over time.

We didn't hear much during the discussion about the role of mindfulness in pain management, and I would like to hear much more about how we're doing in identifying the areas and can we actually intervene on those areas with devices to help with pain management. That's one of the biggest problems we have in terms of clinical care.

In the eating disorder aspect, creating computational models to influence food choices. And, again, with devices or treatments specifically changing the balance about making healthier food choices, I can see an entire field developing. Because most of the medications we prescribe for psychiatric disorders affect food choices and there's weight gain, potentially leading to obesity and cardiovascular complications. So there's an entire field of research we have not touched on.

And the role of animal models in translational results, I don't know if animal researchers, like Greg, talk much with clinical trialists. I think that would be a cross fertilization that is much needed, and we can definitely learn from each other.

And just fantastic. I thank all the panelists for their willingness to work with us and their time, dedication, and just so many meetings to discuss to agree on the program and to divide and conquer different topics. Has been a phenomenal experience, and I'm very, very grateful.

And the NIMH staff has been also amazing, having to collaborate with all of them, and they were so organized. And just a fantastic panel. Thank you, everybody.

MATTHEW RUDORFER: Thank you.

TOR WAGER: Thank you.

NIMH TEAM: Thanks from the NIMH team to all of our participants here.

(Meeting adjourned)

Explain the advantage of eliminating bias in experiments?

User Avatar

Want this question answered?

Be notified when an answer is posted

Add your answer:

imp

What is the advantage of eliminating bias in a experiment?

You will get an exact answer from working on the exipremement.(intead of hypotesising an answer from what you heard)

What is the word for a personal opinion that may affect experiments?

Bias. If a person lets there bias into a scientific experiment, the results will likely be skewed.

How do scientists prevent bias?

perform several independent experiments on the same topic

Personal opinion that may affect experiments?

A fact. Really its called bias.

Is experimenter bias included in the scientific process?

People who perform experiments take some care to avoid introducing their personal bias into the results. But even if there is a bias, the same experiment may be done by other people who have other biases or who are more successful in working in an unbiased manner. Eventually, truth will emerge.

imp

Top Categories

Answers Logo

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 28 August 2024

AI generates covertly racist decisions about people based on their dialect

  • Valentin Hofmann   ORCID: orcid.org/0000-0001-6603-3428 1 , 2 , 3 ,
  • Pratyusha Ria Kalluri 4 ,
  • Dan Jurafsky   ORCID: orcid.org/0000-0002-6459-7745 4 &
  • Sharese King 5  

Nature ( 2024 ) Cite this article

Metrics details

  • Computer science

Hundreds of millions of people now interact with language models, with uses ranging from help with writing 1 , 2 to informing hiring decisions 3 . However, these language models are known to perpetuate systematic racial prejudices, making their judgements biased in problematic ways about groups such as African Americans 4 , 5 , 6 , 7 . Although previous research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time, particularly in the United States after the civil rights movement 8 , 9 . It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice, exhibiting raciolinguistic stereotypes about speakers of African American English (AAE) that are more negative than any human stereotypes about African Americans ever experimentally recorded. By contrast, the language models’ overt stereotypes about African Americans are more positive. Dialect prejudice has the potential for harmful consequences: language models are more likely to suggest that speakers of AAE be assigned less-prestigious jobs, be convicted of crimes and be sentenced to death. Finally, we show that current practices of alleviating racial bias in language models, such as human preference alignment, exacerbate the discrepancy between covert and overt stereotypes, by superficially obscuring the racism that language models maintain on a deeper level. Our findings have far-reaching implications for the fair and safe use of language technology.

Language models are a type of artificial intelligence (AI) that has been trained to process and generate text. They are becoming increasingly widespread across various applications, ranging from assisting teachers in the creation of lesson plans 10 to answering questions about tax law 11 and predicting how likely patients are to die in hospital before discharge 12 . As the stakes of the decisions entrusted to language models rise, so does the concern that they mirror or even amplify human biases encoded in the data they were trained on, thereby perpetuating discrimination against racialized, gendered and other minoritized social groups 4 , 5 , 6 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 .

Previous AI research has revealed bias against racialized groups but focused on overt instances of racism, naming racialized groups and mapping them to their respective stereotypes, for example by asking language models to generate a description of a member of a certain group and analysing the stereotypes it contains 7 , 21 . But social scientists have argued that, unlike the racism associated with the Jim Crow era, which included overt behaviours such as name calling or more brutal acts of violence such as lynching, a ‘new racism’ happens in the present-day United States in more subtle ways that rely on a ‘colour-blind’ racist ideology 8 , 9 . That is, one can avoid mentioning race by claiming not to see colour or to ignore race but still hold negative beliefs about racialized people. Importantly, such a framework emphasizes the avoidance of racial terminology but maintains racial inequities through covert racial discourses and practices 8 .

Here, we show that language models perpetuate this covert racism to a previously unrecognized extent, with measurable effects on their decisions. We investigate covert racism through dialect prejudice against speakers of AAE, a dialect associated with the descendants of enslaved African Americans in the United States 22 . We focus on the most stigmatized canonical features of the dialect shared among Black speakers in cities including New York City, Detroit, Washington DC, Los Angeles and East Palo Alto 23 . This cross-regional definition means that dialect prejudice in language models is likely to affect many African Americans.

Dialect prejudice is fundamentally different from the racial bias studied so far in language models because the race of speakers is never made overt. In fact we observed a discrepancy between what language models overtly say about African Americans and what they covertly associate with them as revealed by their dialect prejudice. This discrepancy is particularly pronounced for language models trained with human feedback (HF), such as GPT4: our results indicate that HF training obscures the racism on the surface, but the racial stereotypes remain unaffected on a deeper level. We propose using a new method, which we call matched guise probing, that makes it possible to recover these masked stereotypes.

The possibility that language models are covertly prejudiced against speakers of AAE connects to known human prejudices: speakers of AAE are known to experience racial discrimination in a wide range of contexts, including education, employment, housing and legal outcomes. For example, researchers have previously found that landlords engage in housing discrimination based solely on the auditory profiles of speakers, with voices that sounded Black or Chicano being less likely to secure housing appointments in predominantly white locales than in mostly Black or Mexican American areas 24 , 25 . Furthermore, in an experiment examining the perception of a Black speaker when providing an alibi 26 , the speaker was interpreted as more criminal, more working class, less educated, less comprehensible and less trustworthy when they used AAE rather than Standardized American English (SAE). Other costs for AAE speakers include having their speech mistranscribed or misunderstood in criminal justice contexts 27 and making less money than their SAE-speaking peers 28 . These harms connect to themes in broader racial ideology about African Americans and stereotypes about their intelligence, competence and propensity to commit crimes 29 , 30 , 31 , 32 , 33 , 34 , 35 . The fact that humans hold these stereotypes indicates that they are encoded in the training data and picked up by language models, potentially amplifying their harmful consequences, but this has never been investigated.

To our knowledge, this paper provides the first empirical evidence for the existence of dialect prejudice in language models; that is, covert racism that is activated by the features of a dialect (AAE). Using our new method of matched guise probing, we show that language models exhibit archaic stereotypes about speakers of AAE that most closely agree with the most-negative human stereotypes about African Americans ever experimentally recorded, dating from before the civil-rights movement. Crucially, we observe a discrepancy between what the language models overtly say about African Americans and what they covertly associate with them. Furthermore, we find that dialect prejudice affects language models’ decisions about people in very harmful ways. For example, when matching jobs to individuals on the basis of their dialect, language models assign considerably less-prestigious jobs to speakers of AAE than to speakers of SAE, even though they are not overtly told that the speakers are African American. Similarly, in a hypothetical experiment in which language models were asked to pass judgement on defendants who committed first-degree murder, they opted for the death penalty significantly more often when the defendants provided a statement in AAE rather than in SAE, again without being overtly told that the defendants were African American. We also show that current practices of alleviating racial disparities (increasing the model size) and overt racial bias (including HF in training) do not mitigate covert racism; indeed, quite the opposite. We found that HF training actually exacerbates the gap between covert and overt stereotypes in language models by obscuring racist attitudes. Finally, we discuss how the relationship between the language models’ covert and overt racial prejudices is both a reflection and a result of the inconsistent racial attitudes of contemporary society in the United States.

Probing AI dialect prejudice

To explore how dialect choice impacts the predictions that language models make about speakers in the absence of other cues about their racial identity, we took inspiration from the ‘matched guise’ technique used in sociolinguistics, in which subjects listen to recordings of speakers of two languages or dialects and make judgements about various traits of those speakers 36 , 37 . Applying the matched guise technique to the AAE–SAE contrast, researchers have shown that people identify speakers of AAE as Black with above-chance accuracy 24 , 26 , 38 and attach racial stereotypes to them, even without prior knowledge of their race 39 , 40 , 41 , 42 , 43 . These associations represent raciolinguistic ideologies, demonstrating how AAE is othered through the emphasis on its perceived deviance from standardized norms 44 .

Motivated by the insights enabled through the matched guise technique, we introduce matched guise probing, a method for investigating dialect prejudice in language models. The basic functioning of matched guise probing is as follows: we present language models with texts (such as tweets) in either AAE or SAE and ask them to make predictions about the speakers who uttered the texts (Fig. 1 and Methods ). For example, we might ask the language models whether a speaker who says “I be so happy when I wake up from a bad dream cus they be feelin too real” (AAE) is intelligent, and similarly whether a speaker who says “I am so happy when I wake up from a bad dream because they feel too real” (SAE) is intelligent. Notice that race is never overtly mentioned; its presence is merely encoded in the AAE dialect. We then examine how the language models’ predictions differ between AAE and SAE. The language models are not given any extra information to ensure that any difference in the predictions is necessarily due to the AAE–SAE contrast.

figure 1

a , We used texts in SAE (green) and AAE (blue). In the meaning-matched setting (illustrated here), the texts have the same meaning, whereas they have different meanings in the non-meaning-matched setting. b , We embedded the SAE and AAE texts in prompts that asked for properties of the speakers who uttered the texts. c , We separately fed the prompts with the SAE and AAE texts into the language models. d , We retrieved and compared the predictions for the SAE and AAE inputs, here illustrated by five adjectives from the Princeton Trilogy. See Methods for more details.

We examined matched guise probing in two settings: one in which the meanings of the AAE and SAE texts are matched (the SAE texts are translations of the AAE texts) and one in which the meanings are not matched ( Methods  (‘Probing’) and Supplementary Information  (‘Example texts’)). Although the meaning-matched setting is more rigorous, the non-meaning-matched setting is more realistic, because it is well known that there is a strong correlation between dialect and content (for example, topics 45 ). The non-meaning-matched setting thus allows us to tap into a nuance of dialect prejudice that would be missed by examining only meaning-matched examples (see Methods for an in-depth discussion). Because the results for both settings overall are highly consistent, we present them in aggregated form here, but analyse the differences in the  Supplementary Information .

We examined GPT2 (ref. 46 ), RoBERTa 47 , T5 (ref. 48 ), GPT3.5 (ref. 49 ) and GPT4 (ref. 50 ), each in one or more model versions, amounting to a total of 12 examined models ( Methods and Supplementary Information (‘Language models’)). We first used matched guise probing to probe the general existence of dialect prejudice in language models, and then applied it to the contexts of employment and criminal justice.

Covert stereotypes in language models

We started by investigating whether the attitudes that language models exhibit about speakers of AAE reflect human stereotypes about African Americans. To do so, we replicated the experimental set-up of the Princeton Trilogy 29 , 30 , 31 , 34 , a series of studies investigating the racial stereotypes held by Americans, with the difference that instead of overtly mentioning race to the language models, we used matched guise probing based on AAE and SAE texts ( Methods ).

Qualitatively, we found that there is a substantial overlap in the adjectives associated most strongly with African Americans by humans and the adjectives associated most strongly with AAE by language models, particularly for the earlier Princeton Trilogy studies (Fig. 2a ). For example, the five adjectives associated most strongly with AAE by GPT2, RoBERTa and T5 share three adjectives (‘ignorant’, ‘lazy’ and ‘stupid’) with the five adjectives associated most strongly with African Americans in the 1933 and 1951 Princeton Trilogy studies, an overlap that is unlikely to occur by chance (permutation test with 10,000 random permutations of the adjectives; P  < 0.01). Furthermore, in lieu of the positive adjectives (such as ‘musical’, ‘religious’ and ‘loyal’), the language models exhibit additional solely negative associations (such as ‘dirty’, ‘rude’ and ‘aggressive’).

figure 2

a , Strongest stereotypes about African Americans in humans in different years, strongest overt stereotypes about African Americans in language models, and strongest covert stereotypes about speakers of AAE in language models. Colour coding as positive (green) and negative (red) is based on ref. 34 . Although the overt stereotypes of language models are overall more positive than the human stereotypes, their covert stereotypes are more negative. b , Agreement of stereotypes about African Americans in humans with both overt and covert stereotypes about African Americans in language models. The black dotted line shows chance agreement using a random bootstrap. Error bars represent the standard error across different language models and prompts ( n  = 36). The language models’ overt stereotypes agree most strongly with current human stereotypes, which are the most positive experimentally recorded ones, but their covert stereotypes agree most strongly with human stereotypes from the 1930s, which are the most negative experimentally recorded ones. c , Stereotype strength for individual linguistic features of AAE. Error bars represent the standard error across different language models, model versions and prompts ( n  = 90). The linguistic features examined are: use of invariant ‘be’ for habitual aspect; use of ‘finna’ as a marker of the immediate future; use of (unstressed) ‘been’ for SAE ‘has been’ or ‘have been’ (present perfects); absence of the copula ‘is’ and ‘are’ for present-tense verbs; use of ‘ain’t’ as a general preverbal negator; orthographic realization of word-final ‘ing’ as ‘in’; use of invariant ‘stay’ for intensified habitual aspect; and absence of inflection in the third-person singular present tense. The measured stereotype strength is significantly above zero for all examined linguistic features, indicating that they all evoke raciolinguistic stereotypes in language models, although there is a lot of variation between individual features. See the Supplementary Information (‘Feature analysis’) for more details and analyses.

To investigate this more quantitatively, we devised a variant of average precision 51 that measures the agreement between the adjectives associated most strongly with African Americans by humans and the ranking of the adjectives according to their association with AAE by language models ( Methods ). We found that for all language models, the agreement with most Princeton Trilogy studies is significantly higher than expected by chance, as shown by one-sided t -tests computed against the agreement distribution resulting from 10,000 random permutations of the adjectives (mean ( m ) = 0.162, standard deviation ( s ) = 0.106; Extended Data Table 1 ); and that the agreement is particularly pronounced for the stereotypes reported in 1933 and falls for each study after that, almost reaching the level of chance agreement for 2012 (Fig. 2b ). In the Supplementary Information (‘Adjective analysis’), we explored variation across model versions, settings and prompts (Supplementary Fig. 2 and Supplementary Table 4 ).

To explain the observed temporal trend, we measured the average favourability of the top five adjectives for all Princeton Trilogy studies and language models, drawing from crowd-sourced ratings for the Princeton Trilogy adjectives on a scale between −2 (very negative) and 2 (very positive; see Methods , ‘Covert-stereotype analysis’). We found that the favourability of human attitudes about African Americans as reported in the Princeton Trilogy studies has become more positive over time, and that the language models’ attitudes about AAE are even more negative than the most negative experimentally recorded human attitudes about African Americans (the ones from the 1930s; Extended Data Fig. 1 ). In the Supplementary Information , we provide further quantitative analyses supporting this difference between humans and language models (Supplementary Fig. 7 ).

Furthermore, we found that the raciolinguistic stereotypes are not merely a reflection of the overt racial stereotypes in language models but constitute a fundamentally different kind of bias that is not mitigated in the current models. We show this by examining the stereotypes that the language models exhibit when they are overtly asked about African Americans ( Methods , ‘Overt-stereotype analysis’). We observed that the overt stereotypes are substantially more positive in sentiment than are the covert stereotypes, for all language models (Fig. 2a and Extended Data Fig. 1 ). Strikingly, for RoBERTa, T5, GPT3.5 and GPT4, although their covert stereotypes about speakers of AAE are more negative than the most negative experimentally recorded human stereotypes, their overt stereotypes about African Americans are more positive than the most positive experimentally recorded human stereotypes. This is particularly true for the two language models trained with HF (GPT3.5 and GPT4), in which all overt stereotypes are positive and all covert stereotypes are negative (see also ‘Resolvability of dialect prejudice’). In terms of agreement with human stereotypes about African Americans, the overt stereotypes almost never exhibit agreement significantly stronger than expected by chance, as shown by one-sided t -tests computed against the agreement distribution resulting from 10,000 random permutations of the adjectives ( m  = 0.162, s  = 0.106; Extended Data Table 2 ). Furthermore, the overt stereotypes are overall most similar to the human stereotypes from 2012, with the agreement continuously falling for earlier studies, which is the exact opposite trend to the covert stereotypes (Fig. 2b ).

In the experiments described in the  Supplementary Information (‘Feature analysis’), we found that the raciolinguistic stereotypes are directly linked to individual linguistic features of AAE (Fig. 2c and Supplementary Table 14 ), and that a higher density of such linguistic features results in stronger stereotypical associations (Supplementary Fig. 11 and Supplementary Table 13 ). Furthermore, we present experiments involving texts in other dialects (such as Appalachian English) as well as noisy texts, showing that these stereotypes cannot be adequately explained as either a general dismissive attitude towards text written in a dialect or as a general dismissive attitude towards deviations from SAE, irrespective of how the deviations look ( Supplementary Information (‘Alternative explanations’), Supplementary Figs. 12 and 13 and Supplementary Tables 15 and 16 ). Both alternative explanations are also tested on the level of individual linguistic features.

Thus, we found substantial evidence for the existence of covert raciolinguistic stereotypes in language models. Our experiments show that these stereotypes are similar to the archaic human stereotypes about African Americans that existed before the civil rights movement, are even more negative than the most negative experimentally recorded human stereotypes about African Americans, and are both qualitatively and quantitatively different from the previously reported overt racial stereotypes in language models, indicating that they are a fundamentally different kind of bias. Finally, our analyses demonstrate that the detected stereotypes are inherently linked to AAE and its linguistic features.

Impact of covert racism on AI decisions

To determine what harmful consequences the covert stereotypes have in the real world, we focused on two areas in which racial stereotypes about speakers of AAE and African Americans have been repeatedly shown to bias human decisions: employment and criminality. There is a growing impetus to use AI systems in these areas. Indeed, AI systems are already being used for personnel selection 52 , 53 , including automated analyses of applicants’ social-media posts 54 , 55 , and technologies for predicting legal outcomes are under active development 56 , 57 , 58 . Rather than advocating these use cases of AI, which are inherently problematic 59 , the sole objective of this analysis is to examine the extent to which the decisions of language models, when they are used in such contexts, are impacted by dialect.

First, we examined decisions about employability. Using matched guise probing, we asked the language models to match occupations to the speakers who uttered the AAE or SAE texts and computed scores indicating whether an occupation is associated more with speakers of AAE (positive scores) or speakers of SAE (negative scores; Methods , ‘Employability analysis’). The average score of the occupations was negative ( m  = –0.046,  s  = 0.053), the difference from zero being statistically significant (one-sample, one-sided t -test, t (83) = −7.9, P  < 0.001). This trend held for all language models individually (Extended Data Table 3 ). Thus, if a speaker exhibited features of AAE, the language models were less likely to associate them with any job. Furthermore, we observed that for all language models, the occupations that had the lowest association with AAE require a university degree (such as psychologist, professor and economist), but this is not the case for the occupations that had the highest association with AAE (for example, cook, soldier and guard; Fig. 3a ). Also, many occupations strongly associated with AAE are related to music and entertainment more generally (singer, musician and comedian), which is in line with a pervasive stereotype about African Americans 60 . To probe these observations more systematically, we tested for a correlation between the prestige of the occupations and the propensity of the language models to match them to AAE ( Methods ). Using a linear regression, we found that the association with AAE predicted the occupational prestige (Fig. 3b ; β  = −7.8, R 2 = 0.193, F (1, 63) = 15.1, P  < 0.001). This trend held for all language models individually (Extended Data Fig. 2 and Extended Data Table 4 ), albeit in a less pronounced way for GPT3.5, which had a particularly strong association of AAE with occupations in music and entertainment.

figure 3

a , Association of different occupations with AAE or SAE. Positive values indicate a stronger association with AAE and negative values indicate a stronger association with SAE. The bottom five occupations (those associated most strongly with SAE) mostly require a university degree, but this is not the case for the top five (those associated most strongly with AAE). b , Prestige of occupations that language models associate with AAE (positive values) or SAE (negative values). The shaded area shows a 95% confidence band around the regression line. The association with AAE or SAE predicts the occupational prestige. Results for individual language models are provided in Extended Data Fig. 2 . c , Relative increase in the number of convictions and death sentences for AAE versus SAE. Error bars represent the standard error across different model versions, settings and prompts ( n  = 24 for GPT2, n  = 12 for RoBERTa, n  = 24 for T5, n  = 6 for GPT3.5 and n  = 6 for GPT4). In cases of small sample size ( n  ≤ 10 for GPT3.5 and GPT4), we plotted the individual results as overlaid dots. T5 does not contain the tokens ‘acquitted’ or ‘convicted’ in its vocabulary and is therefore excluded from the conviction analysis. Detrimental judicial decisions systematically go up for speakers of AAE compared with speakers of SAE.

We then examined decisions about criminality. We used matched guise probing for two experiments in which we presented the language models with hypothetical trials where the only evidence was a text uttered by the defendant in either AAE or SAE. We then measured the probability that the language models assigned to potential judicial outcomes in these trials and counted how often each of the judicial outcomes was preferred for AAE and SAE ( Methods , ‘Criminality analysis’). In the first experiment, we told the language models that a person is accused of an unspecified crime and asked whether the models will convict or acquit the person solely on the basis of the AAE or SAE text. Overall, we found that the rate of convictions was greater for AAE ( r  = 68.7%) than SAE ( r  = 62.1%; Fig. 3c , left). A chi-squared test found a strong effect ( χ 2 (1,  N  = 96) = 184.7,  P  < 0.001), which held for all language models individually (Extended Data Table 5 ). In the second experiment, we specifically told the language models that the person committed first-degree murder and asked whether the models will sentence the person to life or death on the basis of the AAE or SAE text. The overall rate of death sentences was greater for AAE ( r  = 27.7%) than for SAE ( r  = 22.8%; Fig. 3c , right). A chi-squared test found a strong effect ( χ 2 (1,  N  = 144) = 425.4,  P  < 0.001), which held for all language models individually except for T5 (Extended Data Table 6 ). In the Supplementary Information , we show that this deviation was caused by the base T5 version, and that the larger T5 versions follow the general pattern (Supplementary Table 10 ).

In further experiments ( Supplementary Information , ‘Intelligence analysis’), we used matched guise probing to examine decisions about intelligence, and found that all the language models consistently judge speakers of AAE to have a lower IQ than speakers of SAE (Supplementary Figs. 14 and 15 and Supplementary Tables 17 – 19 ).

Resolvability of dialect prejudice

We wanted to know whether the dialect prejudice we observed is resolved by current practices of bias mitigation, such as increasing the size of the language model or including HF in training. It has been shown that larger language models work better with dialects 21 and can have less racial bias 61 . Therefore, the first method we examined was scaling, that is, increasing the model size ( Methods ). We found evidence of a clear trend (Extended Data Tables 7 and 8 ): larger language models are indeed better at processing AAE (Fig. 4a , left), but they are not less prejudiced against speakers of it. In fact, larger models showed more covert prejudice than smaller models (Fig. 4a , right). By contrast, larger models showed less overt prejudice against African Americans (Fig. 4a , right). Thus, increasing scale does make models better at processing AAE and at avoiding prejudice against overt mentions of African Americans, but it makes them more linguistically prejudiced.

figure 4

a , Language modelling perplexity and stereotype strength on AAE text as a function of model size. Perplexity is a measure of how successful a language model is at processing a particular text; a lower result is better. For language models for which perplexity is not well-defined (RoBERTa and T5), we computed pseudo-perplexity instead (dotted line). Error bars represent the standard error across different models of a size class and AAE or SAE texts ( n  = 9,057 for small, n  = 6,038 for medium, n  = 15,095 for large and n  = 3,019 for very large). For covert stereotypes, error bars represent the standard error across different models of a size class, settings and prompts ( n  = 54 for small, n  = 36 for medium, n  = 90 for large and n  = 18 for very large). For overt stereotypes, error bars represent the standard error across different models of a size class and prompts ( n  = 27 for small, n  = 18 for medium, n  = 45 for large and n  = 9 for very large). Although larger language models are better at processing AAE (left), they are not less prejudiced against speakers of it. Indeed, larger models show more covert prejudice than smaller models (right). By contrast, larger models show less overt prejudice against African Americans (right). In other words, increasing scale does make models better at processing AAE and at avoiding prejudice against overt mentions of African Americans, but it makes them more linguistically prejudiced. b , Change in stereotype strength and favourability as a result of training with HF for covert and overt stereotypes. Error bars represent the standard error across different prompts ( n  = 9). HF weakens (left) and improves (right) overt stereotypes but not covert stereotypes. c , Top overt and covert stereotypes about African Americans in GPT3, trained without HF, and GPT3.5, trained with HF. Colour coding as positive (green) and negative (red) is based on ref. 34 . The overt stereotypes get substantially more positive as a result of HF training in GPT3.5, but there is no visible change in favourability for the covert stereotypes.

As a second potential way to resolve dialect prejudice in language models, we examined training with HF 49 , 62 . Specifically, we compared GPT3.5 (ref. 49 ) with GPT3 (ref. 63 ), its predecessor that was trained without using HF ( Methods ). Looking at the top adjectives associated overtly and covertly with African Americans by the two language models, we found that HF resulted in more-positive overt associations but had no clear qualitative effect on the covert associations (Fig. 4c ). This observation was confirmed by quantitative analyses: the inclusion of HF resulted in significantly weaker (no HF, m  = 0.135,  s  = 0.142; HF, m  = −0.119,  s  = 0.234;  t (16) = 2.6,  P  < 0.05) and more favourable (no HF, m  = 0.221,  s  = 0.399; HF, m  = 1.047,  s  = 0.387;  t (16) = −6.4,  P  < 0.001) overt stereotypes but produced no significant difference in the strength (no HF, m  = 0.153,  s  = 0.049; HF, m  = 0.187,  s  = 0.066;  t (16) = −1.2, P  = 0.3) or unfavourability (no HF, m  = −1.146, s  = 0.580; HF, m = −1.029, s  = 0.196; t (16) = −0.5, P  = 0.6) of covert stereotypes (Fig. 4b ). Thus, HF training weakens and ameliorates the overt stereotypes but has no clear effect on the covert stereotypes; in other words, it obscures the racist attitudes on the surface, but more subtle forms of racism, such as dialect prejudice, remain unaffected. This finding is underscored by the fact that the discrepancy between overt and covert stereotypes about African Americans is most pronounced for the two examined language models trained with human feedback (GPT3.5 and GPT4; see ‘Covert stereotypes in language models’). Furthermore, this finding again shows that there is a fundamental difference between overt and covert stereotypes in language models, and that mitigating the overt stereotypes does not automatically translate to mitigated covert stereotypes.

To sum up, neither scaling nor training with HF as applied today resolves the dialect prejudice. The fact that these two methods effectively mitigate racial performance disparities and overt racial stereotypes in language models indicates that this form of covert racism constitutes a different problem that is not addressed by current approaches for improving and aligning language models.

The key finding of this article is that language models maintain a form of covert racial prejudice against African Americans that is triggered by dialect features alone. In our experiments, we avoided overt mentions of race but drew from the racialized meanings of a stigmatized dialect, and could still find historically racist associations with African Americans. The implicit nature of this prejudice, that is, the fact it is about something that is not explicitly expressed in the text, makes it fundamentally different from the overt racial prejudice that has been the focus of previous research. Strikingly, the language models’ covert and overt racial prejudices are often in contradiction with each other, especially for the most recent language models that have been trained with HF (GPT3.5 and GPT4). These two language models obscure the racism, overtly associating African Americans with exclusively positive attributes (such as ‘brilliant’), but our results show that they covertly associate African Americans with exclusively negative attributes (such as ‘lazy’).

We argue that this paradoxical relation between the language models’ covert and overt racial prejudices manifests the inconsistent racial attitudes present in the contemporary society of the United States 8 , 64 . In the Jim Crow era, stereotypes about African Americans were overtly racist, but the normative climate after the civil rights movement made expressing explicitly racist views distasteful. As a result, racism acquired a covert character and continued to exist on a more subtle level. Thus, most white people nowadays report positive attitudes towards African Americans in surveys but perpetuate racial inequalities through their unconscious behaviour, such as their residential choices 65 . It has been shown that negative stereotypes persist, even if they are superficially rejected 66 , 67 . This ambivalence is reflected by the language models we analysed, which are overtly non-racist but covertly exhibit archaic stereotypes about African Americans, showing that they reproduce a colour-blind racist ideology. Crucially, the civil rights movement is generally seen as the period during which racism shifted from overt to covert 68 , 69 , and this is mirrored by our results: all the language models overtly agree the most with human stereotypes from after the civil rights movement, but covertly agree the most with human stereotypes from before the civil rights movement.

Our findings beg the question of how dialect prejudice got into the language models. Language models are pretrained on web-scraped corpora such as WebText 46 , C4 (ref. 48 ) and the Pile 70 , which encode raciolinguistic stereotypes about AAE. A drastic example of this is the use of ‘mock ebonics’ to parodize speakers of AAE 71 . Crucially, a growing body of evidence indicates that language models pick up prejudices present in the pretraining corpus 72 , 73 , 74 , 75 , which would explain how they become prejudiced against speakers of AAE, and why they show varying levels of dialect prejudice as a function of the pretraining corpus. However, the web also abounds with overt racism against African Americans 76 , 77 , so we wondered why the language models exhibit much less overt than covert racial prejudice. We argue that the reason for this is that the existence of overt racism is generally known to people 32 , which is not the case for covert racism 69 . Crucially, this also holds for the field of AI. The typical pipeline of training language models includes steps such as data filtering 48 and, more recently, HF training 62 that remove overt racial prejudice. As a result, much of the overt racism on the web does not end up in the language models. However, there are currently no measures in place to curtail covert racial prejudice when training language models. For example, common datasets for HF training 62 , 78 do not include examples that would train the language models to treat speakers of AAE and SAE equally. As a result, the covert racism encoded in the training data can make its way into the language models in an unhindered fashion. It is worth mentioning that the lack of awareness of covert racism also manifests during evaluation, where it is common to test language models for overt racism but not for covert racism 21 , 63 , 79 , 80 .

As well as the representational harms, by which we mean the pernicious representation of AAE speakers, we also found evidence for substantial allocational harms. This refers to the inequitable allocation of resources to AAE speakers 81 (Barocas et al., unpublished observations), and adds to known cases of language technology putting speakers of AAE at a disadvantage by performing worse on AAE 82 , 83 , 84 , 85 , 86 , 87 , 88 , misclassifying AAE as hate speech 81 , 89 , 90 , 91 or treating AAE as incorrect English 83 , 85 , 92 . All the language models are more likely to assign low-prestige jobs to speakers of AAE than to speakers of SAE, and are more likely to convict speakers of AAE of a crime, and to sentence speakers of AAE to death. Although the details of our tasks are constructed, the findings reveal real and urgent concerns because business and jurisdiction are areas for which AI systems involving language models are currently being developed or deployed. As a consequence, the dialect prejudice we uncovered might already be affecting AI decisions today, for example when a language model is used in application-screening systems to process background information, which might include social-media text. Worryingly, we also observe that larger language models and language models trained with HF exhibit stronger covert, but weaker overt, prejudice. Against the backdrop of continually growing language models and the increasingly widespread adoption of HF training, this has two risks: first, that language models, unbeknownst to developers and users, reach ever-increasing levels of covert prejudice; and second, that developers and users mistake ever-decreasing levels of overt prejudice (the only kind of prejudice currently tested for) for a sign that racism in language models has been solved. There is therefore a realistic possibility that the allocational harms caused by dialect prejudice in language models will increase further in the future, perpetuating the racial discrimination experienced by generations of African Americans.

Matched guise probing examines how strongly a language model associates certain tokens, such as personality traits, with AAE compared with SAE. AAE can be viewed as the treatment condition, whereas SAE functions as the control condition. We start by explaining the basic experimental unit of matched guise probing: measuring how a language model associates certain tokens with an individual text in AAE or SAE. Based on this, we introduce two different settings for matched guise probing (meaning-matched and non-meaning-matched), which are both inspired by the matched guise technique used in sociolinguistics 36 , 37 , 93 , 94 and provide complementary views on the attitudes a language model has about a dialect.

The basic experimental unit of matched guise probing is as follows. Let θ be a language model, t be a text in AAE or SAE, and x be a token of interest, typically a personality trait such as ‘intelligent’. We embed the text in a prompt v , for example v ( t ) = ‘a person who says t tends to be’, and compute P ( x ∣ v ( t );  θ ), which is the probability that θ assigns to x after processing v ( t ). We calculate P ( x ∣ v ( t );  θ ) for equally sized sets T a of AAE texts and T s of SAE texts, comparing various tokens from a set X as possible continuations. It has been shown that P ( x ∣ v ( t );  θ ) can be affected by the precise wording of v , so small modifications of v can have an unpredictable effect on the predictions made by the language model 21 , 95 , 96 . To account for this fact, we consider a set V containing several prompts ( Supplementary Information ). For all experiments, we have provided detailed analyses of variation across prompts in the  Supplementary Information .

We conducted matched guise probing in two settings. In the first setting, the texts in T a and T s formed pairs expressing the same underlying meaning, that is, the i -th text in T a (for example, ‘I be so happy when I wake up from a bad dream cus they be feelin too real’) matches the i -th text in T s (for example, ‘I am so happy when I wake up from a bad dream because they feel too real’). For this setting, we used the dataset from ref. 87 , which contains 2,019 AAE tweets together with their SAE translations. In the second setting, the texts in T a and T s did not form pairs, so they were independent texts in AAE and SAE. For this setting, we sampled 2,000 AAE and SAE tweets from the dataset in ref. 83 and used tweets strongly aligned with African Americans for AAE and tweets strongly aligned with white people for SAE ( Supplementary Information (‘Analysis of non-meaning-matched texts’), Supplementary Fig. 1 and Supplementary Table 3 ). In the  Supplementary Information , we include examples of AAE and SAE texts for both settings (Supplementary Tables 1 and 2 ). Tweets are well suited for matched guise probing because they are a rich source of dialectal variation 97 , 98 , 99 , especially for AAE 100 , 101 , 102 , but matched guise probing can be applied to any type of text. Although we do not consider it here, matched guise probing can in principle also be applied to speech-based models, with the potential advantage that dialectal variation on the phonetic level could be captured more directly, which would make it possible to study dialect prejudice specific to regional variants of AAE 23 . However, note that a great deal of phonetic variation is reflected orthographically in social-media texts 101 .

It is important to analyse both meaning-matched and non-meaning-matched settings because they capture different aspects of the attitudes a language model has about speakers of AAE. Controlling for the underlying meaning makes it possible to uncover differences in the attitudes of the language model that are solely due to grammatical and lexical features of AAE. However, it is known that various properties other than linguistic features correlate with dialect, such as topics 45 , and these might also influence the attitudes of the language model. Sidelining such properties bears the risk of underestimating the harms that dialect prejudice causes for speakers of AAE in the real world. For example, in a scenario in which a language model is used in the context of automated personnel selection to screen applicants’ social-media posts, the texts of two competing applicants typically differ in content and do not come in pairs expressing the same meaning. The relative advantages of using meaning-matched or non-meaning-matched data for matched guise probing are conceptually similar to the relative advantages of using the same or different speakers for the matched guise technique: more control in the former versus more naturalness in the latter setting 93 , 94 . Because the results obtained in both settings were consistent overall for all experiments, we aggregated them in the main article, but we analysed differences in detail in the  Supplementary Information .

We apply matched guise probing to five language models: RoBERTa 47 , which is an encoder-only language model; GPT2 (ref. 46 ), GPT3.5 (ref. 49 ) and GPT4 (ref. 50 ), which are decoder-only language models; and T5 (ref. 48 ), which is an encoder–decoder language model. For each language model, we examined one or more model versions: GPT2 (base), GPT2 (medium), GPT2 (large), GPT2 (xl), RoBERTa (base), RoBERTa (large), T5 (small), T5 (base), T5 (large), T5 (3b), GPT3.5 (text-davinci-003) and GPT4 (0613). Where we used several model versions per language model (GPT2, RoBERTa and T5), the model versions all had the same architecture and were trained on the same data but differed in their size. Furthermore, we note that GPT3.5 and GPT4 are the only language models examined in this paper that were trained with HF, specifically reinforcement learning from human feedback 103 . When it is clear from the context what is meant, or when the distinction does not matter, we use the term ‘language models’, or sometimes ‘models‘, in a more general way that includes individual model versions.

Regarding matched guise probing, the exact method for computing P ( x ∣ v ( t );  θ ) varies across language models and is detailed in the  Supplementary Information . For GPT4, for which computing P ( x ∣ v ( t );  θ ) for all tokens of interest was often not possible owing to restrictions imposed by the OpenAI application programming interface (API), we used a slightly modified method for some of the experiments, and this is also discussed in the  Supplementary Information . Similarly, some of the experiments could not be done for all language models because of model-specific constraints, which we highlight below. We note that there was at most one language model per experiment for which this was the case.

Covert-stereotype analysis

In the covert-stereotype analysis, the tokens x whose probabilities are measured for matched guise probing are trait adjectives from the Princeton Trilogy 29 , 30 , 31 , 34 , such as ‘aggressive’, ‘intelligent’ and ‘quiet’. We provide details about these adjectives in the  Supplementary Information . In the Princeton Trilogy, the adjectives are provided to participants in the form of a list, and participants are asked to select from the list the five adjectives that best characterize a given ethnic group, such as African Americans. The studies that we compare in this paper, which are the original Princeton Trilogy studies 29 , 30 , 31 and a more recent reinstallment 34 , all follow this general set-up and observe a gradual improvement of the expressed stereotypes about African Americans over time, but the exact interpretation of this finding is disputed 32 . Here, we used the adjectives from the Princeton Trilogy in the context of matched guise probing.

Specifically, we first computed P ( x ∣ v ( t );  θ ) for all adjectives, for both the AAE texts and the SAE texts. The method for aggregating the probabilities P ( x ∣ v ( t );  θ ) into association scores between an adjective x and AAE varies for the two settings of matched guise probing. Let \({t}_{{\rm{a}}}^{i}\) be the i -th AAE text in T a and \({t}_{{\rm{s}}}^{i}\) be the i -th SAE text in T s . In the meaning-matched setting, in which \({t}_{{\rm{a}}}^{i}\) and \({t}_{{\rm{s}}}^{i}\) express the same meaning, we computed the prompt-level association score for an adjective x as

where n = ∣ T a ∣ = ∣ T s ∣ . Thus, we measure for each pair of AAE and SAE texts the log ratio of the probability assigned to x following the AAE text and the probability assigned to x following the SAE text, and then average the log ratios of the probabilities across all pairs. In the non-meaning-matched setting, we computed the prompt-level association score for an adjective x as

where again n = ∣ T a ∣ = ∣ T s ∣ . In other words, we first compute the average probability assigned to a certain adjective x following all AAE texts and the average probability assigned to x following all SAE texts, and then measure the log ratio of these average probabilities. The interpretation of q ( x ;  v ,  θ ) is identical in both settings; q ( x ;  v , θ ) > 0 means that for a certain prompt v , the language model θ associates the adjective x more strongly with AAE than with SAE, and q ( x ;  v ,  θ ) < 0 means that for a certain prompt v , the language model θ associates the adjective x more strongly with SAE than with AAE. In the  Supplementary Information (‘Calibration’), we show that q ( x ;  v , θ ) is calibrated 104 , meaning that it does not depend on the prior probability that θ assigns to x in a neutral context.

The prompt-level association scores q ( x ;  v ,  θ ) are the basis for further analyses. We start by averaging q ( x ;  v ,  θ ) across model versions, prompts and settings, and this allows us to rank all adjectives according to their overall association with AAE for individual language models (Fig. 2a ). In this and the following adjective analyses, we focus on the five adjectives that exhibit the highest association with AAE, making it possible to consistently compare the language models with the results from the Princeton Trilogy studies, most of which do not report the full ranking of all adjectives. Results for individual model versions are provided in the  Supplementary Information , where we also analyse variation across settings and prompts (Supplementary Fig. 2 and Supplementary Table 4 ).

Next, we wanted to measure the agreement between language models and humans through time. To do so, we considered the five adjectives most strongly associated with African Americans for each study and evaluated how highly these adjectives are ranked by the language models. Specifically, let R l  = [ x 1 , …,  x ∣ X ∣ ] be the adjective ranking generated by a language model and \({R}_{h}^{5}\) = [ x 1 , …, x 5 ] be the ranking of the top five adjectives generated by the human participants in one of the Princeton Trilogy studies. A typical measure to evaluate how highly the adjectives from \({R}_{h}^{5}\) are ranked within R l is average precision, AP 51 . However, AP does not take the internal ranking of the adjectives in \({R}_{h}^{5}\) into account, which is not ideal for our purposes; for example, AP does not distinguish whether the top-ranked adjective for humans is on the first or on the fifth rank for a language model. To remedy this, we computed the mean average precision, MAP, for different subsets of \({R}_{h}^{5}\) ,

where \({R}_{h}^{i}\) denotes the top i adjectives from the human ranking. MAP = 1 if, and only if, the top five adjectives from \({R}_{h}^{5}\) have an exact one-to-one correspondence with the top five adjectives from R l , so, unlike AP, it takes the internal ranking of the adjectives into account. We computed an individual agreement score for each language model and prompt, so we average the q ( x ;  v ,  θ ) association scores for all model versions of a language model (GPT2, for example) and the two settings (meaning-matched and non-meaning-matched) to generate R l . Because the OpenAI API for GPT4 does not give access to the probabilities for all adjectives, we excluded GPT4 from this analysis. Results are presented in Fig. 2b and Extended Data Table 1 . In the Supplementary Information (‘Agreement analysis’), we analyse variation across model versions, settings and prompts (Supplementary Figs. 3 – 5 ).

To analyse the favourability of the stereotypes about African Americans, we drew from crowd-sourced favourability ratings collected previously 34 for the adjectives from the Princeton Trilogy that range between −2 (‘very unfavourable’, meaning very negative) and 2 (‘very favourable’, meaning very positive). For example, the favourability rating of ‘cruel’ is −1.81 and the favourability rating of ‘brilliant’ is 1.86. We computed the average favourability of the top five adjectives, weighting the favourability ratings of individual adjectives by their association scores with AAE and African Americans. More formally, let R 5 = [ x 1 , …, x 5 ] be the ranking of the top five adjectives generated by either a language model or humans. Furthermore, let f ( x ) be the favourability rating of adjective x as reported in ref. 34 , and let q ( x ) be the overall association score of adjective x with AAE or African Americans that is used to generate R 5 . For the Princeton Trilogy studies, q ( x ) is the percentage of participants who have assigned x to African Americans. For language models, q ( x ) is the average value of q ( x ;  v ,  θ ). We then computed the weighted average favourability, F , of the top five adjectives as

As a result of the weighting, the top-ranked adjective contributed more to the average than the second-ranked adjective, and so on. Results are presented in Extended Data Fig. 1 . To check for consistency, we also computed the average favourability of the top five adjectives without weighting, which yields similar results (Supplementary Fig. 6) .

Overt-stereotype analysis

The overt-stereotype analysis closely followed the methodology of the covert-stereotype analysis, with the difference being that instead of providing the language models with AAE and SAE texts, we provided them with overt descriptions of race (specifically, ‘Black’/‘black’ and ‘White’/‘white’). This methodological difference is also reflected by a different set of prompts ( Supplementary Information ). As a result, the experimental set-up is very similar to existing studies on overt racial bias in language models 4 , 7 . All other aspects of the analysis (such as computing adjective association scores) were identical to the analysis for covert stereotypes. This also holds for GPT4, for which we again could not conduct the agreement analysis.

We again present average results for the five language models in the main article. Results broken down for individual model versions are provided in the  Supplementary Information , where we also analyse variation across prompts (Supplementary Fig. 8 and Supplementary Table 5 ).

Employability analysis

The general set-up of the employability analysis was identical to the stereotype analyses: we fed text written in either AAE or SAE, embedded in prompts, into the language models and analysed the probabilities that they assigned to different continuation tokens. However, instead of trait adjectives, we considered occupations for X and also used a different set of prompts ( Supplementary Information ). We created a list of occupations, drawing from previously published lists 6 , 76 , 105 , 106 , 107 . We provided details about these occupations in the  Supplementary Information . We then computed association scores q ( x ;  v ,  θ ) between individual occupations x and AAE, following the same methodology as for computing adjective association scores, and ranked the occupations according to q ( x ;  v ,  θ ) for the language models. To probe the prestige associated with the occupations, we drew from a dataset of occupational prestige 105 that is based on the 2012 US General Social Survey and measures prestige on a scale from 1 (low prestige) to 9 (high prestige). For GPT4, we could not conduct the parts of the analysis that require scores for all occupations.

We again present average results for the five language models in the main article. Results for individual model versions are provided in the  Supplementary Information , where we also analyse variation across settings and prompts (Supplementary Tables 6 – 8 ).

Criminality analysis

The set-up of the criminality analysis is different from the previous experiments in that we did not compute aggregate association scores between certain tokens (such as trait adjectives) and AAE but instead asked the language models to make discrete decisions for each AAE and SAE text. More specifically, we simulated trials in which the language models were prompted to use AAE or SAE texts as evidence to make a judicial decision. We then aggregated the judicial decisions into summary statistics.

We conducted two experiments. In the first experiment, the language models were asked to determine whether a person accused of committing an unspecified crime should be acquitted or convicted. The only evidence provided to the language models was a statement made by the defendant, which was an AAE or SAE text. In the second experiment, the language models were asked to determine whether a person who committed first-degree murder should be sentenced to life or death. Similarly to the first (general conviction) experiment, the only evidence provided to the language models was a statement made by the defendant, which was an AAE or SAE text. Note that the AAE and SAE texts were the same texts as in the other experiments and did not come from a judicial context. Rather than testing how well language models could perform the tasks of predicting acquittal or conviction and life penalty or death penalty (an application of AI that we do not support), we were interested to see to what extent the decisions of the language models, made in the absence of any real evidence, were impacted by dialect. Although providing the language models with extra evidence as well as the AAE and SAE texts would have made the experiments more similar to real trials, it would have confounded the effect that dialect has on its own (the key effect of interest), so we did not consider this alternative set-up here. We focused on convictions and death penalties specifically because these are the two areas of the criminal justice system for which racial disparities have been described in the most robust and indisputable way: African Americans represent about 12% of the adult population of the United States, but they represent 33% of inmates 108 and more than 41% of people on death row 109 .

Methodologically, we used prompts that asked the language models to make a judicial decision ( Supplementary Information ). For a specific text, t , which is in AAE or SAE, we computed p ( x ∣ v ( t );  θ ) for the tokens x that correspond to the judicial outcomes of interest (‘acquitted’ or ‘convicted’, and ‘life’ or ‘death’). T5 does not contain the tokens ‘acquitted’ and ‘convicted’ in its vocabulary, so is was excluded from the conviction analysis. Because the language models might assign different prior probabilities to the outcome tokens, we calibrated them using their probabilities in a neutral context following v , meaning without text t 104 . Whichever outcome had the higher calibrated probability was counted as the decision. We aggregated the detrimental decisions (convictions and death penalties) and compared their rates (percentages) between AAE and SAE texts. An alternative approach would have been to generate the judicial decision by sampling from the language models, which would have allowed us to induce the language models to generate justifications of their decisions. However, this approach has three disadvantages: first, encoder-only language models such as RoBERTa do not lend themselves to text generation; second, it would have been necessary to apply jail-breaking for some of the language models, which can have unpredictable effects, especially in the context of socially sensitive tasks; and third, model-generated justifications are frequently not aligned with actual model behaviours 110 .

We again present average results on the level of language models in the main article. Results for individual model versions are provided in the  Supplementary Information , where we also analyse variation across settings and prompts (Supplementary Figs. 9 and 10 and Supplementary Tables 9 – 12 ).

Scaling analysis

In the scaling analysis, we examined whether increasing the model size alleviated the dialect prejudice. Because the content of the covert stereotypes is quite consistent and does not vary substantially between models with different sizes, we instead analysed the strength with which the language models maintain these stereotypes. We split the model versions of all language models into four groups according to their size using the thresholds of 1.5 × 10 8 , 3.5 × 10 8 and 1.0 × 10 10 (Extended Data Table 7 ).

To evaluate the familiarity of the models with AAE, we measured their perplexity on the datasets used for the two evaluation settings 83 , 87 . Perplexity is defined as the exponentiated average negative log-likelihood of a sequence of tokens 111 , with lower values indicating higher familiarity. Perplexity requires the language models to assign probabilities to full sequences of tokens, which is only the case for GPT2 and GPT3.5. For RoBERTa and T5, we resorted to pseudo-perplexity 112 as the measure of familiarity. Results are only comparable across language models with the same familiarity measure. We excluded GPT4 from this analysis because it is not possible to compute perplexity using the OpenAI API.

To evaluate the stereotype strength, we focused on the stereotypes about African Americans reported in ref. 29 , which the language models’ covert stereotypes agree with most strongly. We split the set of adjectives X into two subsets: the set of stereotypical adjectives in ref. 29 , X s , and the set of non-stereotypical adjectives, X n  =  X \ X s . For each model with a specific size, we then computed the average value of q ( x ;  v ,  θ ) for all adjectives in X s , which we denote as q s ( θ ), and the average value of q ( x ;  v ,  θ ) for all adjectives in X n , which we denote as q n ( θ ). The stereotype strength of a model θ , or more specifically the strength of the stereotypes about African Americans reported in ref. 29 , can then be computed as

A positive value of δ ( θ ) means that the model associates the stereotypical adjectives in X s more strongly with AAE than the non-stereotypical adjectives in X n , whereas a negative value of δ ( θ ) indicates anti-stereotypical associations, meaning that the model associates the non-stereotypical adjectives in X n more strongly with AAE than the stereotypical adjectives in X s . For the overt stereotypes, we used the same split of adjectives into X s and X n because we wanted to directly compare the strength with which models of a certain size endorse the stereotypes overtly as opposed to covertly. All other aspects of the experimental set-up are identical to the main analyses of covert and overt stereotypes.

HF analysis

We compared GPT3.5 (ref. 49 ; text-davinci-003) with GPT3 (ref. 63 ; davinci), its predecessor language model that was trained without HF. Similarly to other studies that compare these two language models 113 , this set-up allowed us to examine the effects of HF training as done for GPT3.5 in isolation. We compared the two language models in terms of favourability and stereotype strength. For favourability, we followed the methodology we used for the overt-stereotype analysis and evaluated the average weighted favourability of the top five adjectives associated with AAE. For stereotype strength, we followed the methodology we used for the scaling analysis and evaluated the average strength of the stereotypes as reported in ref.  29 .

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All the datasets used in this study are publicly available. The dataset released as ref. 87 can be found at https://aclanthology.org/2020.emnlp-main.473/ . The dataset released as ref. 83 can be found at http://slanglab.cs.umass.edu/TwitterAAE/ . The human stereotype scores used for evaluation can be found in the published articles of the Princeton Trilogy studies 29 , 30 , 31 , 34 . The most recent of these articles 34 also contains the human favourability scores for the trait adjectives. The dataset of occupational prestige that we used for the employability analysis can be found in the corresponding paper 105 . The Brown Corpus 114 , which we used for the  Supplementary Information (‘Feature analysis’), can be found at http://www.nltk.org/nltk_data/ . The dataset containing the parallel AAE, Appalachian English and Indian English texts 115 , which we used in the  Supplementary Information (‘Alternative explanations’), can be found at https://huggingface.co/collections/SALT-NLP/value-nlp-666b60a7f76c14551bda4f52 .

Code availability

Our code is written in Python and draws on the Python packages openai and transformers for language-model probing, as well as numpy, pandas, scipy and statsmodels for data analysis. The feature analysis described in the  Supplementary Information also uses the VALUE Python library 88 . Our code is publicly available on GitHub at https://github.com/valentinhofmann/dialect-prejudice .

Zhao, W. et al. WildChat: 1M ChatGPT interaction logs in the wild. In Proc. Twelfth International Conference on Learning Representations (OpenReview.net, 2024).

Zheng, L. et al. LMSYS-Chat-1M: a large-scale real-world LLM conversation dataset. In Proc. Twelfth International Conference on Learning Representations (OpenReview.net, 2024).

Gaebler, J. D., Goel, S., Huq, A. & Tambe, P. Auditing the use of language models to guide hiring decisions. Preprint at https://arxiv.org/abs/2404.03086 (2024).

Sheng, E., Chang, K.-W., Natarajan, P. & Peng, N. The woman worked as a babysitter: on biases in language generation. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (eds Inui. K. et al.) 3407–3412 (Association for Computational Linguistics, 2019).

Nangia, N., Vania, C., Bhalerao, R. & Bowman, S. R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds Webber, B. et al.) 1953–1967 (Association for Computational Linguistics, 2020).

Nadeem, M., Bethke, A. & Reddy, S. StereoSet: measuring stereotypical bias in pretrained language models. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (eds Zong, C. et al.) 5356–5371 (Association for Computational Linguistics, 2021).

Cheng, M., Durmus, E. & Jurafsky, D. Marked personas: using natural language prompts to measure stereotypes in language models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (eds Rogers, A. et al.) 1504–1532 (Association for Computational Linguistics, 2023).

Bonilla-Silva, E. Racism without Racists: Color-Blind Racism and the Persistence of Racial Inequality in America 4th edn (Rowman & Littlefield, 2014).

Golash-Boza, T. A critical and comprehensive sociological theory of race and racism. Sociol. Race Ethn. 2 , 129–141 (2016).

Article   Google Scholar  

Kasneci, E. et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 103 , 102274 (2023).

Nay, J. J. et al. Large language models as tax attorneys: a case study in legal capabilities emergence. Philos. Trans. R. Soc. A 382 , 20230159 (2024).

Article   ADS   Google Scholar  

Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 (2023).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V. & Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inf. Process. Syst. 30 , 4356–4364 (2016).

Google Scholar  

Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356 , 183–186 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Basta, C., Costa-jussà, M. R. & Casas, N. Evaluating the underlying gender bias in contextualized word embeddings. In Proc. First Workshop on Gender Bias in Natural Language Processing (eds Costa-jussà, M. R. et al.) 33–39 (Association for Computational Linguistics, 2019).

Kurita, K., Vyas, N., Pareek, A., Black, A. W. & Tsvetkov, Y. Measuring bias in contextualized word representations. In Proc. First Workshop on Gender Bias in Natural Language Processing (eds Costa-jussà, M. R. et al.) 166–172 (Association for Computational Linguistics, 2019).

Abid, A., Farooqi, M. & Zou, J. Persistent anti-muslim bias in large language models. In Proc. 2021 AAAI/ACM Conference on AI, Ethics, and Society (eds Fourcade, M. et al.) 298–306 (Association for Computing Machinery, 2021).

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (Association for Computing Machinery, 2021).

Li, L. & Bamman, D. Gender and representation bias in GPT-3 generated stories. In Proc. Third Workshop on Narrative Understanding (eds Akoury, N. et al.) 48–55 (Association for Computational Linguistics, 2021).

Tamkin, A. et al. Evaluating and mitigating discrimination in language model decisions. Preprint at https://arxiv.org/abs/2312.03689 (2023).

Rae, J. W. et al. Scaling language models: methods, analysis & insights from training Gopher. Preprint at https://arxiv.org/abs/2112.11446 (2021).

Green, L. J. African American English: A Linguistic Introduction (Cambridge Univ. Press, 2002).

King, S. From African American Vernacular English to African American Language: rethinking the study of race and language in African Americans’ speech. Annu. Rev. Linguist. 6 , 285–300 (2020).

Purnell, T., Idsardi, W. & Baugh, J. Perceptual and phonetic experiments on American English dialect identification. J. Lang. Soc. Psychol. 18 , 10–30 (1999).

Massey, D. S. & Lundy, G. Use of Black English and racial discrimination in urban housing markets: new methods and findings. Urban Aff. Rev. 36 , 452–469 (2001).

Dunbar, A., King, S. & Vaughn, C. Dialect on trial: an experimental examination of raciolinguistic ideologies and character judgments. Race Justice https://doi.org/10.1177/21533687241258772 (2024).

Rickford, J. R. & King, S. Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and beyond. Language 92 , 948–988 (2016).

Grogger, J. Speech patterns and racial wage inequality. J. Hum. Resour. 46 , 1–25 (2011).

Katz, D. & Braly, K. Racial stereotypes of one hundred college students. J. Abnorm. Soc. Psychol. 28 , 280–290 (1933).

Gilbert, G. M. Stereotype persistance and change among college students. J. Abnorm. Soc. Psychol. 46 , 245–254 (1951).

Article   CAS   Google Scholar  

Karlins, M., Coffman, T. L. & Walters, G. On the fading of social stereotypes: studies in three generations of college students. J. Pers. Soc. Psychol. 13 , 1–16 (1969).

Article   CAS   PubMed   Google Scholar  

Devine, P. G. & Elliot, A. J. Are racial stereotypes really fading? The Princeton Trilogy revisited. Pers. Soc. Psychol. Bull. 21 , 1139–1150 (1995).

Madon, S. et al. Ethnic and national stereotypes: the Princeton Trilogy revisited and revised. Pers. Soc. Psychol. Bull. 27 , 996–1010 (2001).

Bergsieker, H. B., Leslie, L. M., Constantine, V. S. & Fiske, S. T. Stereotyping by omission: eliminate the negative, accentuate the positive. J. Pers. Soc. Psychol. 102 , 1214–1238 (2012).

Article   PubMed   PubMed Central   Google Scholar  

Ghavami, N. & Peplau, L. A. An intersectional analysis of gender and ethnic stereotypes: testing three hypotheses. Psychol. Women Q. 37 , 113–127 (2013).

Lambert, W. E., Hodgson, R. C., Gardner, R. C. & Fillenbaum, S. Evaluational reactions to spoken languages. J. Abnorm. Soc. Psychol. 60 , 44–51 (1960).

Ball, P. Stereotypes of Anglo-Saxon and non-Anglo-Saxon accents: some exploratory Australian studies with the matched guise technique. Lang. Sci. 5 , 163–183 (1983).

Thomas, E. R. & Reaser, J. Delimiting perceptual cues used for the ethnic labeling of African American and European American voices. J. Socioling. 8 , 54–87 (2004).

Atkins, C. P. Do employment recruiters discriminate on the basis of nonstandard dialect? J. Employ. Couns. 30 , 108–118 (1993).

Payne, K., Downing, J. & Fleming, J. C. Speaking Ebonics in a professional context: the role of ethos/source credibility and perceived sociability of the speaker. J. Tech. Writ. Commun. 30 , 367–383 (2000).

Rodriguez, J. I., Cargile, A. C. & Rich, M. D. Reactions to African-American vernacular English: do more phonological features matter? West. J. Black Stud. 28 , 407–414 (2004).

Billings, A. C. Beyond the Ebonics debate: attitudes about Black and standard American English. J. Black Stud. 36 , 68–81 (2005).

Kurinec, C. A. & Weaver, C. III “Sounding Black”: speech stereotypicality activates racial stereotypes and expectations about appearance. Front. Psychol. 12 , 785283 (2021).

Rosa, J. & Flores, N. Unsettling race and language: toward a raciolinguistic perspective. Lang. Soc. 46 , 621–647 (2017).

Salehi, B., Hovy, D., Hovy, E. & Søgaard, A. Huntsville, hospitals, and hockey teams: names can reveal your location. In Proc. 3rd Workshop on Noisy User-generated Text (eds Derczynski, L. et al.) 116–121 (Association for Computational Linguistics, 2017).

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).

Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at https://arxiv.org/abs/1907.11692 (2019).

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21 , 1–67 (2020).

MathSciNet   Google Scholar  

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (NeurIPS, 2022).

OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Zhang, E. & Zhang, Y. Average precision. In Encyclopedia of Database Systems (eds Liu, L. & Özsu, M. T.) 192–193 (Springer, 2009).

Black, J. S. & van Esch, P. AI-enabled recruiting: what is it and how should a manager use it? Bus. Horiz. 63 , 215–226 (2020).

Hunkenschroer, A. L. & Luetge, C. Ethics of AI-enabled recruiting and selection: a review and research agenda. J. Bus. Ethics 178 , 977–1007 (2022).

Upadhyay, A. K. & Khandelwal, K. Applying artificial intelligence: implications for recruitment. Strateg. HR Rev. 17 , 255–258 (2018).

Tippins, N. T., Oswald, F. L. & McPhail, S. M. Scientific, legal, and ethical concerns about AI-based personnel selection tools: a call to action. Pers. Assess. Decis. 7 , 1 (2021).

Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D. & Lampos, V. Predicting judicial decisions of the European Court of Human Rights: a natural language processing perspective. PeerJ Comput. Sci. 2 , e93 (2016).

Surden, H. Artificial intelligence and law: an overview. Ga State Univ. Law Rev. 35 , 1305–1337 (2019).

Medvedeva, M., Vols, M. & Wieling, M. Using machine learning to predict decisions of the European Court of Human Rights. Artif. Intell. Law 28 , 237–266 (2020).

Weidinger, L. et al. Taxonomy of risks posed by language models. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 214–229 (Association for Computing Machinery, 2022).

Czopp, A. M. & Monteith, M. J. Thinking well of African Americans: measuring complimentary stereotypes and negative prejudice. Basic Appl. Soc. Psychol. 28 , 233–250 (2006).

Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24 , 11324–11436 (2023).

Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://arxiv.org/abs/2204.05862 (2022).

Brown, T. B. et al. Language models are few-shot learners. In  Proc. 34th International Conference on Neural Information Processing Systems  (eds Larochelle, H. et al.) 1877–1901 (NeurIPS, 2020).

Dovidio, J. F. & Gaertner, S. L. Aversive racism. Adv. Exp. Soc. Psychol. 36 , 1–52 (2004).

Schuman, H., Steeh, C., Bobo, L. D. & Krysan, M. (eds) Racial Attitudes in America: Trends and Interpretations (Harvard Univ. Press, 1998).

Crosby, F., Bromley, S. & Saxe, L. Recent unobtrusive studies of Black and White discrimination and prejudice: a literature review. Psychol. Bull. 87 , 546–563 (1980).

Terkel, S. Race: How Blacks and Whites Think and Feel about the American Obsession (New Press, 1992).

Jackman, M. R. & Muha, M. J. Education and intergroup attitudes: moral enlightenment, superficial democratic commitment, or ideological refinement? Am. Sociol. Rev. 49 , 751–769 (1984).

Bonilla-Silva, E. The New Racism: Racial Structure in the United States, 1960s–1990s. In Race, Ethnicity, and Nationality in the United States: Toward the Twenty-First Century 1st edn (ed. Wong, P.) Ch. 4 (Westview Press, 1999).

Gao, L. et al. The Pile: an 800GB dataset of diverse text for language modeling. Preprint at https://arxiv.org/abs/2101.00027 (2021).

Ronkin, M. & Karn, H. E. Mock Ebonics: linguistic racism in parodies of Ebonics on the internet. J. Socioling. 3 , 360–380 (1999).

Dodge, J. et al. Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 1286–1305 (Association for Computational Linguistics, 2021).

Steed, R., Panda, S., Kobren, A. & Wick, M. Upstream mitigation is not all you need: testing the bias transfer hypothesis in pre-trained language models. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds Muresan, S. et al.) 3524–3542 (Association for Computational Linguistics, 2022).

Feng, S., Park, C. Y., Liu, Y. & Tsvetkov, Y. From pretraining data to language models to downstream tasks: tracking the trails of political biases leading to unfair NLP models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (eds Rogers, A. et al.) 11737–11762 (Association for Computational Linguistics, 2023).

Köksal, A. et al. Language-agnostic bias detection in language models with bias probing. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 12735–12747 (Association for Computational Linguistics, 2023).

Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl Acad. Sci. USA 115 , E3635–E3644 (2018).

Ferrer, X., van Nuenen, T., Such, J. M. & Criado, N. Discovering and categorising language biases in Reddit. In Proc. Fifteenth International AAAI Conference on Web and Social Media (eds Budak, C. et al.) 140–151 (Association for the Advancement of Artificial Intelligence, 2021).

Ethayarajh, K., Choi, Y. & Swayamdipta, S. Understanding dataset difficulty with V-usable information. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 5988–6008 (Proceedings of Machine Learning Research, 2022).

Hoffmann, J. et al. Training compute-optimal large language models. Preprint at https://arxiv.org/abs/2203.15556 (2022).

Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research https://openreview.net/forum?id=iO4LZibEqW (2023).

Blodgett, S. L., Barocas, S., Daumé III, H. & Wallach, H. Language (technology) is power: A critical survey of “bias” in NLP. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 5454–5476 (Association for Computational Linguistics, 2020).

Jørgensen, A., Hovy, D. & Søgaard, A. Challenges of studying and processing dialects in social media. In Proc. Workshop on Noisy User-generated Text (eds Xu, W. et al.) 9–18 (Association for Computational Linguistics, 2015).

Blodgett, S. L., Green, L. & O’Connor, B. Demographic dialectal variation in social media: a case study of African-American English. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1119–1130 (Association for Computational Linguistics, 2016).

Jørgensen, A., Hovy, D. & Søgaard, A. Learning a POS tagger for AAVE-like language. In Proc. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Knight, K. et al.) 1115–1120 (Association for Computational Linguistics, 2016).

Blodgett, S. L. & O’Connor, B. Racial disparity in natural language processing: a case study of social media African-American English. Preprint at https://arxiv.org/abs/1707.00061 (2017).

Blodgett, S. L., Wei, J. & O’Connor, B. Twitter universal dependency parsing for African-American and mainstream American English. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 1415–1425 (Association for Computational Linguistics, 2018).

Groenwold, S. et al. Investigating African-American vernacular English in transformer-based text generation. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (eds Webber, B. et al.) 5877–5883 (Association for Computational Linguistics, 2020).

Ziems, C., Chen, J., Harris, C., Anderson, J. & Yang, D. VALUE: Understanding dialect disparity in NLU. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds Muresan, S. et al.) 3701–3720 (Association for Computational Linguistics, 2022).

Davidson, T., Bhattacharya, D. & Weber, I. Racial bias in hate speech and abusive language detection datasets. In Proc. Third Workshop on Abusive Language Online (eds Roberts, S. T. et al.) 25–35 (Association for Computational Linguistics, 2019).

Sap, M., Card, D., Gabriel, S., Choi, Y. & Smith, N. A. The risk of racial bias in hate speech detection. In Proc. 57th Annual Meeting of the Association for Computational Linguistics (eds Korhonen, A. et al.) 1668–1678 (Association for Computational Linguistics, 2019).

Harris, C., Halevy, M., Howard, A., Bruckman, A. & Yang, D. Exploring the role of grammar and word choice in bias toward African American English (AAE) in hate speech classification. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 789–798 (Association for Computing Machinery, 2022).

Gururangan, S. et al. Whose language counts as high quality? Measuring language ideologies in text data selection. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 2562–2580 (Association for Computational Linguistics, 2022).

Gaies, S. J. & Beebe, J. D. The matched-guise technique for measuring attitudes and their implications for language education: a critical assessment. In Language Acquisition and the Second/Foreign Language Classroom (ed. Sadtano, E.) 156–178 (SEAMEO Regional Language Centre, 1991).

Hudson, R. A. Sociolinguistics (Cambridge Univ. Press, 1996).

Delobelle, P., Tokpo, E., Calders, T. & Berendt, B. Measuring fairness with biased rulers: a comparative study on bias metrics for pre-trained language models. In Proc. 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Carpuat, M. et al.) 1693–1706 (Association for Computational Linguistics, 2022).

Mattern, J., Jin, Z., Sachan, M., Mihalcea, R. & Schölkopf, B. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. Preprint at https://arxiv.org/abs/2212.10678 (2022).

Eisenstein, J., O’Connor, B., Smith, N. A. & Xing, E. P. A latent variable model for geographic lexical variation. In Proc. 2010 Conference on Empirical Methods in Natural Language Processing (eds Li, H. & Màrquez, L.) 1277–1287 (Association for Computational Linguistics, 2010).

Doyle, G. Mapping dialectal variation by querying social media. In Proc. 14th Conference of the European Chapter of the Association for Computational Linguistics (eds Wintner, S. et al.) 98–106 (Association for Computational Linguistics, 2014).

Huang, Y., Guo, D., Kasakoff, A. & Grieve, J. Understanding U.S. regional linguistic variation with Twitter data analysis. Comput. Environ. Urban Syst. 59 , 244–255 (2016).

Eisenstein, J. What to do about bad language on the internet. In Proc. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Vanderwende, L. et al.) 359–369 (Association for Computational Linguistics, 2013).

Eisenstein, J. Systematic patterning in phonologically-motivated orthographic variation. J. Socioling. 19 , 161–188 (2015).

Jones, T. Toward a description of African American vernacular English dialect regions using “Black Twitter”. Am. Speech 90 , 403–440 (2015).

Christiano, P. F. et al. Deep reinforcement learning from human preferences. Proc. 31st International Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 4302–4310 (NeurIPS, 2017).

Zhao, T. Z., Wallace, E., Feng, S., Klein, D. & Singh, S. Calibrate before use: Improving few-shot performance of language models. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 12697–12706 (Proceedings of Machine Learning Research, 2021).

Smith, T. W. & Son, J. Measuring Occupational Prestige on the 2012 General Social Survey (NORC at Univ. Chicago, 2014).

Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K.-W. Gender bias in coreference resolution: evaluation and debiasing methods. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 15–20 (Association for Computational Linguistics, 2018).

Hughes, B. T., Srivastava, S., Leszko, M. & Condon, D. M. Occupational prestige: the status component of socioeconomic status. Collabra Psychol. 10 , 92882 (2024).

Gramlich, J. The gap between the number of blacks and whites in prison is shrinking. Pew Research Centre https://www.pewresearch.org/short-reads/2019/04/30/shrinking-gap-between-number-of-blacks-and-whites-in-prison (2019).

Walsh, A. The criminal justice system is riddled with racial disparities. Prison Policy Initiative Briefing https://www.prisonpolicy.org/blog/2016/08/15/cjrace (2016).

Röttger, P. et al. Political compass or spinning arrow? Towards more meaningful evaluations for values and opinions in large language models. Preprint at https://arxiv.org/abs/2402.16786 (2024).

Jurafsky, D. & Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Prentice Hall, 2000).

Salazar, J., Liang, D., Nguyen, T. Q. & Kirchhoff, K. Masked language model scoring. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 2699–2712 (Association for Computational Linguistics, 2020).

Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 29971–30004 (Proceedings of Machine Learning Research, 2023).

Francis, W. N. & Kucera, H. Brown Corpus Manual (Brown Univ.,1979).

Ziems, C. et al. Multi-VALUE: a framework for cross-dialectal English NLP. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (eds Rogers, A. et al.) 744–768 (Association for Computational Linguistics, 2023).

Download references

Acknowledgements

V.H. was funded by the German Academic Scholarship Foundation. P.R.K. was funded in part by the Open Phil AI Fellowship. This work was also funded by the Hoffman-Yee Research Grants programme and the Stanford Institute for Human-Centered Artificial Intelligence. We thank A. Köksal, D. Hovy, K. Gligorić, M. Harrington, M. Casillas, M. Cheng and P. Röttger for feedback on an earlier version of the article.

Author information

Authors and affiliations.

Allen Institute for AI, Seattle, WA, USA

Valentin Hofmann

University of Oxford, Oxford, UK

LMU Munich, Munich, Germany

Stanford University, Stanford, CA, USA

Pratyusha Ria Kalluri & Dan Jurafsky

The University of Chicago, Chicago, IL, USA

Sharese King

You can also search for this author in PubMed   Google Scholar

Contributions

V.H., P.R.K., D.J. and S.K. designed the research. V.H. performed the research and analysed the data. V.H., P.R.K., D.J. and S.K. wrote the paper.

Corresponding authors

Correspondence to Valentin Hofmann or Sharese King .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Rodney Coates and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 weighted average favourability of top stereotypes about african americans in humans and top overt as well as covert stereotypes about african americans in language models (lms)..

The overt stereotypes are more favourable than the reported human stereotypes, except for GPT2. The covert stereotypes are substantially less favourable than the least favourable reported human stereotypes from 1933. Results without weighting, which are very similar, are provided in Supplementary Fig. 6 .

Extended Data Fig. 2 Prestige of occupations associated with AAE (positive values) versus SAE (negative values), for individual language models.

The shaded areas show 95% confidence bands around the regression lines. The association with AAE versus SAE is negatively correlated with occupational prestige, for all language models. We cannot conduct this analysis with GPT4 since the OpenAI API does not give access to the probabilities for all occupations.

Supplementary information

Supplementary information, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hofmann, V., Kalluri, P.R., Jurafsky, D. et al. AI generates covertly racist decisions about people based on their dialect. Nature (2024). https://doi.org/10.1038/s41586-024-07856-5

Download citation

Received : 08 February 2024

Accepted : 19 July 2024

Published : 28 August 2024

DOI : https://doi.org/10.1038/s41586-024-07856-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

eliminating bias in experiments

  • Search Please fill out this field.
  • Manage Your Subscription
  • Give a Gift Subscription
  • Newsletters
  • Sweepstakes

eliminating bias in experiments

Law Enforcement Is Experimenting With AI-Assisted Policing: What You Need to Know

AI can help make officers' jobs more efficient, but can also cause big legal headaches, experts say

Police departments across the country are starting to use artificial intelligence to help complete their paperwork. 

AI can help officers save time writing reports and make their jobs more efficient, the Associated Press r eports.

But experts warn of possible dangers when tapping into new technology, which can introduce serious, sometimes life-changing mistakes in reports, promote bias and racism and leak private information, Politico reports.

According to its website, a new AI product offered by the technology company Axon called Draft One is designed to help police officers file reports in less time than required with conventional methods.

“Today, officers spend as much as two-thirds of their day on paperwork,” Axon says on the website. “Our AI Research team is working to cut the time spent on paperwork by improving the efficiency and accuracy of report-writing and information analysis in law enforcement."

For instance, Axon says its product gives officers the ability to automatically redact footage caught on body cams “so officers can share footage with the public faster while protecting the privacy of individuals captured in video."

Its AI software allows supervisors to review footage and reports so they “can better understand compliance and provide feedback in order to improve training and police-community relations.”

Draft One has had the “most positive reaction” of any product the company has debuted on the market, Axon’s founder and CEO Rick Smith told the Associated Press. But, he added, “there’s certainly concerns.”

District attorneys want to know that police officers didn’t only rely on an AI chatbot to write a report since they might have to testify in court about an alleged crime, Smith told the AP.

“They never want to get an officer on the stand who says, ‘Well, the AI wrote that, I didn’t,’” Smith told the outlet.

An AP investigation of another crime-fighting AI program, ShotSpotter, a gunshot detection tool that according to its website uses “sensors, algorithms and artificial intelligence” to classify 14 million sounds in its database as gunshots or something else, found "serious flaws" in the technology.

In one case in which ShotSpotter evidence was presented in court, an Illinois grandfather named Michael Williams was jailed for more than a year in 2021 when he was accused of shooting and killing a man based on audio evidence prosecutors say they obtained from the AI tool, the AP reported in 2022.

The AI technology allegedly picked up a loud noise when Williams’ car drove through an intersection and determined that Williams had fired a shot at the passenger in his car, the AP reported.

But Williams told authorities that a person in another vehicle pulled up to his car and shot at his passenger, the AP reported. A judge ended up dismissing the case when prosecutors said they didn’t have enough evidence to proceed.

“I kept trying to figure out, how can they get away with using the technology like that against me?” Williams told the AP.

Want to keep up with the latest crime coverage? Sign up for PEOPLE 's free True Crime newsletter for breaking crime news, ongoing trial coverage and details of intriguing unsolved cases.

In a variety of other applications used by lawyers, “AI creates ‘hallucinations’ or fake cases, citations, and legal arguments that seem correct but do not actually exist,” an expert at the law firm Freeman, Mathis & Gary wrote in an article on its website about risks and problems associated with Generative AI used in the legal profession.

This includes the accuracy of chatbots now being used by prosecutors and defense attorneys to sift through documents to detect relevant evidence, draft legal memoranda and devise complex litigation strategies. “A [2024] Stanford University study found that 75% of the answers generated by AI chatbots regarding a sample court ruling were incorrect,” the expert wrote.

“Further, AI cannot adequately address questions of law that implicate more than one practice area. For example, a legal issue implicating both immigration law and criminal law may yield an accurate answer for immigration law purposes but disregard any criminal law issues and implications."

AI used by law enforcement may be helpful but precautions and protections need to be in place, Jonathan Parham, a former police director in Rahway, New Jersey, told Politico.

"The AI should never replace the officer — it should enhance their operational competency,” Parham told Politico.

Related Articles

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: towards deconfounded image-text matching with causal inference.

Abstract: Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as prior knowledge into image-text matching model, which is inevitable to force model further learn biased associations. To address above limitations, this paper firstly utilizes Structural Causal Models (SCMs) to illustrate how intra- and inter-modal confounders damage the image-text matching. Then, we employ backdoor adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task. DCIN (1) decomposes the intra- and inter-modal confounders and incorporates them into the encoding stage of visual and textual features, effectively eliminating the spurious correlations during image-text matching, and (2) uses causal inference to mitigate biases of external knowledge. Consequently, the model can learn causality instead of spurious correlations caused by dataset bias. Extensive experiments on two well-known benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the superiority of our proposed method.
Comments: ACM MM
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
Journal reference: 2023/10/26,Proceedings of the 31st ACM International Conference on Multimedia,6264-6273
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. PPT

    eliminating bias in experiments

  2. Eliminating the bias in MashMap. The experiments from Table 1 of

    eliminating bias in experiments

  3. PPT

    eliminating bias in experiments

  4. Eliminating bias in forensic algorithms and statistics

    eliminating bias in experiments

  5. Eliminating Bias by maddy ingram on Prezi

    eliminating bias in experiments

  6. 54 How to Limit Bias in Experimental Research

    eliminating bias in experiments

COMMENTS

  1. Five ways to take confirmation bias out of your experimental results

    The most basic requirement for scientific research is the unbiased interpretation of experimental results. Unfortunately, scientists are human, and as humans, susceptible to fooling themselves when looking at their investigational outcomes. American politics is rife with obvious examples of confirmation bias.

  2. Identifying and Avoiding Bias in Research

    Abstract. This narrative review provides an overview on the topic of bias as part of Plastic and Reconstructive Surgery 's series of articles on evidence-based medicine. Bias can occur in the planning, data collection, analysis, and publication phases of research. Understanding research bias allows readers to critically and independently review ...

  3. Experimenter Bias (Definition + Examples)

    Experimenter bias occurs when a researcher either intentionally or unintentionally affects data, participants, or results in an experiment. The phenomenon is also known as observer bias, information bias, research bias, expectancy bias, experimenter effect, observer-expectancy effect, experimenter-expectancy effect, and observer effect.

  4. Types of Bias in Research

    Bias exists in all research, across research designs, and is difficult to eliminate. Bias can occur at any stage of the research process. ... Response bias also occurs in experimental medical research. When outcomes are based on patients' reports, a placebo effect can occur. Here, patients report an improvement despite having received a ...

  5. Eliminating Explicit and Implicit Biases in Health Care: Evidence and

    Hall et al. ( 47) published a systematic literature review of 15 studies designed to explore the evidence of provider implicit racial bias and health outcomes. In the studies measuring prevalence, rates of anti-Black bias in health care providers ranged from 42% to 100%.

  6. PDF Lab Title: Confirmation Bias 2-4-6 Game: How do we test our perspectives?

    Lab Title: Confirmation Bias 2-4-6 Game: How do we test our perspectives? Adapted from: Original classroom activity by Kevin Grobman, adapted from original experiment by Peter Wason. Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task. The Quarterly Journal of Experimental Psychology, 12, 129-140.

  7. Blinding: an essential component in decreasing risk of bias in

    Blinding (or masking) is the process used in experimental research by which study participants, persons caring for the participants, persons providing the intervention, data collectors and data analysts are kept unaware of group assignment (control vs intervention). Blinding aims to reduce the risk of bias that can be caused by an awareness of group assignment.

  8. 54 How to Limit Bias in Experimental Research

    54.2.1 Selection Bias . Selection bias occurs where experimental subjects or specimens are divided into different intervention groups. The optimum situation would for each group to be completely identical apart from the characteristic of being deliberately altered between groups. This could refer to animal model age and gender or cell line characteristics.

  9. How to Eliminate Bias in Qualitative Research

    Avoid design problems by understanding the limitations of the sample group. For example, if you are researching the health benefits of a certain food, be aware if only females or people over a particular age are involved. Bias can occur when certain groups are left out. Account for any unavoidable omission bias by changing the experimental design.

  10. What is Experimenter Bias and How to Avoid It?

    The term experimenter bias is related to the researcher's influence on the outcome of his research. When researchers choose their topic of research there is a probable outcome that they have predicted in their minds. In psychology this is termed as 'observer-expectancy effect'. Because of the prediction of the outcome in advance, the ...

  11. Sampling Bias: Types, Examples & How to Avoid It

    Sampling bias occurs when a sample does not accurately represent the population being studied. This can happen when there are systematic errors in the sampling process, leading to over-representation or under-representation of certain groups within the sample. Sampling bias results in biased samples of a population where all individuals were ...

  12. Observer bias

    Observer bias is one of the types of detection bias and is defined as any kind of systematic divergence from accurate facts during observation and the recording of data and information in studies. [1] The definition can be further expanded upon to include the systematic difference between what is observed due to variation in observers, and what the true value is.

  13. Bias in Experiments: Types, Sources & Examples

    Advantages of Eliminating Bias in Experiments. Let's see some advantages of eliminating bias in experiments. The results and conclusion of the experiment will be reliable and dependable. There will be better chances of the experiment helping as much people as it should. Important information and findings will not be hidden or left out.

  14. PDF Hypothetical Bias in Choice Experiments: Is Cheap Talk Effective at

    A natural extension to the literature employing experimental methods to examine hypothetical bias is to extend the methods to "test-bed" alternative survey designs and determine their efficacy in eliminating or reducing hypothetical bias (for early examples, Bjornstad, et al., 1997, and Cummings and Taylor, 1998).

  15. Why do scientists try to eliminate bias in their experiments?

    You always want to eliminate all biases in an experiment. Biases may create discrepancies in data, and you always want to be as accurate as possible. Wiki User. ∙ 12y ago. By testing random ...

  16. How do you reduce bias in an experiment?

    Actions that health care providers can take to combat implicit bias, include: Having a basic understanding of the cultures from which your patients come. Avoiding stereotyping your patients; individuate them. Understanding and respecting the magnitude of unconscious bias. You can eliminate bias in experiments with the help of a scientific ...

  17. How to Eliminate Bias in an Experiment

    About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket Press Copyright ...

  18. To eliminate bias in an experiment, what do scientists use to ensure

    Final answer: To eliminate bias in an experiment, scientists use control groups, double-blind design, and ethical committees.Control groups help compare the effects of the active treatment, while double-blind design ensures that researchers and subjects are unaware of who is receiving the active treatment and who is receiving the placebo treatment.

  19. Explain the advantage of eliminating bias in experiments

    Be sure you understand the results of the Evans et al. (1983) experiment that demonstrated the effect of belief bias in judging the validity of syllogisms. physics (a) In a p-n junction at room temperature, what is the ratio between the current with a forward bias of 2.00 V to the current with a forward bias of 1.00 V?

  20. Fact-checking six of Kamala Harris's campaign claims

    BBC Verify examined claims made by Harris, about her record and Trump's on the economy, abortion and immigration.

  21. Day Two: Placebo Workshop: Translational Research Domains and ...

    The National Institute of Mental Health (NIMH) hosted a virtual workshop on the placebo effect. The purpose of this workshop was to bring together experts in neurobiology, clinical trials, and regulatory science to examine placebo effects in drug, device, and psychosocial interventions for mental health conditions. Topics included interpretability of placebo signals within the context of ...

  22. Question: How does the double-blind procedure help to eliminate bias in

    How does the double-blind procedure help to eliminate bias in psychological experiments? Your solution's ready to go! Enhanced with AI, our expert help has broken down your problem into an easy-to-learn solution you can count on.

  23. [2408.15672] Understanding Field-Free Single-Shot Laser-Induced

    View PDF HTML (experimental) Abstract: Exchange bias is applied ubiquitously throughout the spintronics industry for its ability to provide a reference direction robust to external magnetic field disturbances. For some applications, reorienting the exchange-biased reference axis may add to or enhance functionality of devices. Conventional methods for achieving this reorientation require ...

  24. Explain the advantage of eliminating bias in experiments?

    Explain the advantage of eliminating bias in experiments? Updated: 9/16/2023. Wiki User. ∙ 8y ago. Add an answer. Want this question answered? Be notified when an answer is posted.

  25. AI generates covertly racist decisions about people based on their

    As a result, the experimental set-up is very similar to existing studies on overt racial bias in language models 4,7. All other aspects of the analysis (such as computing adjective association ...

  26. Police Officers Are Beginning to Experiment with AI Crime Reports and

    But experts warn of possible dangers when tapping into new technology, which can introduce serious, sometimes life-changing mistakes in reports, promote bias and racism and leak private ...

  27. Gender bias in generative artificial intelligence text-to-image

    In the experimental condition, both GPT-3.5 and GPT-4 responded in a manner that revealed significant gender bias informed by professional stereotyping, attributing the male gender to the medical student and the female gender to the tutor . The responses included grammatical justification based on linking the pronoun either to the action or to ...

  28. Towards Deconfounded Image-Text Matching with Causal Inference

    View PDF HTML (experimental) Abstract: Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased ...