The average wage for African Americans in the data is 808.5, and for others the average wage is 1174. That means that African Americans earn (in this really specific data set) 61% of how much men earn 365.5 less.
Let’s say we take that fact to someone that doesn’t believe that African Americans are discriminated against. We’ll call them you’re “contrarian friend”, you can fill in other ideas of what you’d think about that person. What will their response be? Probably that it isn’t evidence of discrimination, because of course African Americans earn less, they’re less likely to work in white collar jobs. And people that work in white collar jobs earn more, so that’s the reason African Americans earn less. It’s not discrimination, it’s just that they work different jobs.
And on the surface, they’d be right. African Americans are more likely to work in blue collar jobs (65% to 50%), and blue collar jobs earn less (956 for blue collar jobs to 1350 for white collar jobs).
ethnicity | blue_collar |
---|---|
other | 0.5018 |
afam | 0.6512 |
blue_collar | wage |
---|---|
0 | 1350 |
1 | 956.4 |
So what we’d want to do then is compare African Americans to others that both work blue collar jobs, and African Americans to others working white collar jobs. If there is a difference in wages between two people working the same job, that’s better evidence that the pay gap is a result not of their occupational choices but their race.
We can visualize that with a two by two chart.
Let’s work across that chart to see what it tells us. A 2 by 2 chart like that is called a cross tab because it let’s us tab ulate figures a cross different characteristics of our data. They can be a methodologically simple way (we’re just showing means/averages there) to tell a story if the data is clear.
So what do we learn? Looking at the top row, white collar workers that are labeled other for ethnicity earn on average $1373. And white collar workers that are African American earn $918. Which means that for white collar workers, African Americans earn $455 less. For blue collar workers, other races earn $977, while African Americans earn $749. That’s a gap of $228. So the size of the gap is different depending on what a persons job is, but African American’s earn less regardless of their job. So it isn’t just that African Americans are less likely to work white collar jobs that drives their lower wages. Even those in white collar jobs earn less. In fact, African Americans in white collar jobs earn less on average than other races working blue collar jobs!
This is what it means to hold something constant. In that table above we’re holding occupation constant, and comparing people based on their race to people of another race that work the same job. So differences in those jobs aren’t influencing our results now, we’ve set that effect aside for the moment.
And we can do that automatically with regression, like we did when we looked at the effect of computers on math scores, while holding the impact of school enrollment constant.
Based on that regression results, African Americans earn $309 less than other races when holding occupation constant, and that effect is highly significant. And blue collar workers earn $380 less than white collar workers when holding race constant, and that effect is significant too.
So have we proven discrimination in wages? Probably not yet for the contrarian friend. Without pause they’ll likely say that education is also important for wages, and African Americans are less likely to go to college. And in the data they’d be correct. On average African Americans completed 11.65 years of education, and other races completed 12.94.
ethnicity | education |
---|---|
other | 12.94 |
afam | 11.65 |
So let’s add that to our regression too.
Now with the ethnicity variable we’re comparing people of different ethnicities that have the same occupation and education. And what do we find? Even holding both of those constant, we would expect an African American worker to earn $262 less, and that is highly significant.
What your contrarian friend is doing is proposing alternative variables and hypotheses that explain the gap in earnings for African Americans. And while those other things do make a difference they don’t explain fully why African Americans earn less than others. We have shrunk the gap somewhat. Originally the gap was 465, which fell to 309 when we held occupation constant and now 262 with the inclusion of education. So those alternative explanations do explain a portion of why African Americans earned less, it was because they had lower-status jobs and less education (setting aside the fact that their lower-status jobs and less education may be the result of discrimination).
So what else do we want to include to try and explain that difference in wages? We can insert all of the variables in the data set to see if there is still a gap in wages between African Americans and others.
Controlling for occupation, education, experience, weeks worked, the industry, the region of employment, whether they are married, their gender, and their union status, does ethnicity make a difference in earnings? Yes, if you found two workers that had the same values for all of those variables except that they were of different races, the African American would still likely earn less.
In our regression African Americans earn $167 less when holding occupation, education, experience, weeks worked, the industry, region, marriage, gender, and their union status constant, and that effect is still statistically significant.
The contrarian friend may still have another alternative hypothesis to attempt to explain away that result, but unfortunately that’s all the data will let us test.
What we’re attempting to do is minimize what is called the missing variable bias . If there is a plausible story that explains our result, whether one is predicting math test scores or wages or whatever else, if we fail to account for that explanation our model may be misleading. It was misleading to say that computers don’t increase math test scores when we didn’t control for the effect of larger school sizes.
What missing variables do we not have that may explain the difference in earnings between African Americans and others? We don’t know who is a manager at work or anything about job performance, and both of those should help explain why people earn more. So we haven’t removed our missing variable bias, the evidence we can provide is limited by that. But based on the evidence we can generate, we find evidence of racial discrimination in wages.
And I should again emphasize, even if something else did explain the gap in earnings between African Americans and others it wouldn’t prove there wasn’t discrimination in society. If differences in occupation did explain the racial gap in wages, that wouldn’t prove the discrimination didn’t push African Americans towards lower paying jobs.
But the work we’ve done above is similar to what a law firm would do if bringing a lawsuit against a large employer for wage discrimination. It’s hard to prove discrimination in individual cases. The employer will always just argue that John is a bad employee, and that’s why they earn less than their coworkers. Wage discrimination suits are typically brought as class action suits, where a large group of employees sues based on evidence that even when accounting for differences in specific job, and job performance, and experience, and other things there is still a gap in wages.
I should add a note about interpretation here. It’s the researcher that has to identify what they different coefficients means in the real world. We can talk about discrimination because of differences in earnings for African Americans and others, but we wouldn’t say that blue collar workers are discriminated against because they earn less than white collar workers. It’s unlikely that someone would say that people with more experience earning more is the result of discrimination. These are interpretations that we layer on to the analysis based on our expectations and understanding of the research question.
Regression can be used to make predictions and learn more about the world in all sorts of contexts. Let’s work through another example, with a little more focus on the interpretation.
We’ll use a data set called Affairs, which unsurprisingly has data about affairs. Or more specifically, about people, and whether or not they have had an affair.
In the data set there are 10 variables.
So we can throw all of those variables into a regression and see which ones have the largest impact on the likelihood someone had an affair. But before that we should pause to make predictions. We shouldn’t just include a variable just for laughs - we should have a reason for including it. We should be able to make a prediction for whether it should increase or decrease the dependent variable.
So what effect do you think these independent variables will have on the chances of someone having had an affair?
Those arguments may be wrong or right. And they certainly wont be right in every case in the data - there will be counter examples. What I’ve tried to do is lay out predictions, or hypotheses, for what I expect the model to show us. Let’s test them all and see what predicts whether someone had an affair.
What do you see as the strongest predictors of whether someone had an affair? Let’s start by identifying what was highly statistically significant. Religiousness and rating both had p-values below .001, so we can be very confident that in the population people who are more religious and who report having happier marriages are both less likely to have affairs. Let’s interpret that more formally.
For each one unit increase in religiousness an individual’s chances of having an affair decrease by .05 holding their gender, age, years married, children, education, occupation and rating constant, and that change is significant.
That’s a long list of things we’re holding constant! When you get past 2 or 3 control variables, or when you’re describing different variables from the same model you can use “holding all else constant” in place of the list.
For each one unit increase in the happiness rating of a marriage an individual’s chances of having an affair decrease by .09, holding all else constant , and that change is significant.
What else that we included in the model is useful for predicting whether someone had an affair?
Age and years married both reach statistical significance. As individuals get older, their chances of having an affair decrease, as I predicted.
However, as their marriages get longer the chances of having had an affair increase, not decrease as I thought Interesting! Does that mean I should go back and change my prediction? No. What it likely means is that some of my assumptions were wrong, so I should update them and discuss why I was wrong (in the conclusion if this was a paper). If we only used regression to find things that we already know, we wouldn’t learn anything new. It’s still good that I made a prediction though because that highlights that the result is a little weird (to my eyes) or may be more surprising to the readers. Imagine if you found that a new jobs program actually lowered participants incomes, that would be a really important outcome of your research and just as valuable as if you’d found that incomes increase.
A surprising finding could also be evidence that there’s something wrong in the data. Did we enter years of marriage correctly, or did we possibly reverse it where longer marriages are actually coded as lower numbers. That’d be odd in this case, but it’s always worth thinking that possibility through. If I got data that showed college graduates earned less than those without a high school degree I’d be very skeptical of the data, because that would go against everything we know. It might just be an odd, fluky one-time finding, or it could be evidence something is wrong in the data.
Okay, what about everything else? All the other variables are insignificant. Should we remove them from the analysis, since they don’t have a significant effect on our dependent variable? It depends. Insignificant variables can be worth including in most cases in order to show that they don’t have an effect on the outcome. It’s worth knowing that gender and children don’t have an effect on affairs in the population. We had a reason to think they would, and it turns out they don’t really have much of an influence on whether someone has sex outside their marriage. That’s good to know.
I didn’t have a prediction for education or occupation though, and the fact they are insignificant means they aren’t really worth including. I’m not testing any interesting ideas about what affects affairs with those variables, they’re just being included because they’re in the data. That’s not a good reason for them to be there, we want to be testing something with each variable we include.
In truth, we haven’t done a lot of new work on code in this chapter. We’ve more so focused on this big idea of what it means to go from bivariate regression to multivariate regression. So we wont do a lot of practice, because the basic structure we learned in the last chapter drives most of what we’ll do.
We’ll read in some new data, that’s on Massachusetts schools and test scores there. It’s similar to the California Schools data, but from Massachusetts for variety.
We’ll focus on 4 of those variable, and try to figure out what predicts how schools do on tests in 8th grade (score8).
Let’s start by practicing writing a regression to look at the impact of spending (exptot) on test scores.
That should look very similar to the last chapter. And we can interpret it the same way.
For each one unit increase in spending, we observe a .004 increase in test scores for 8th graders, and that change is significant.
Let’s add one more variable to the regression, and now include english along with exptot. To include an additional variable we just place a + sign between the two variables, as shown below.
Each one unit increase in spending is associated with a .007 increase in test scores for 8th graders, holding the percentage of english speakers constant, and that change is significant.
Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 4.1 decrease in test scores for 8th graders, holding the spending constant, and that change is significant.
And one more, let’s add one more variable: income.
Interesting, spending actually lost its significance in that final regression and change directions.
Each one unit increase in spending is associated with a .002 decrease in test scores for 8th graders when holding the percentage of english speakers and parental income constant, but that change is insignificant.
Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 2.2 decrease in test scores for 8th graders when holding spending and parental income constant, and that change is significant.
Each one unit increase in parental income is associated with a 2.8 increase in test scores for 8th graders when holding spending and the percentage of english speakers constant, and that change is significant.
The following video demonstrates the coding steps done above.
Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.
Learning Objectives
At the end of this section you should be able to answer the following questions:
Multiple Regression is a step beyond simple regression. The main difference between simple and multiple regression is that multiple regression includes two or more independent variables – sometimes called predictor variables – in the model, rather than just one.
As such, the purpose of multiple regression is to determine the utility of a set of predictor variables for predicting an outcome, which is generally some important event or behaviour. This outcome can be designated as the outcome variable, the dependent variable, or the criterion variable. For example, you might hypothesise that the need to belong will predict motivations for Facebook use and that self-esteem and meaningful existence will uniquely predict motivations for Facebook use.
Before beginning your analysis, you should consider the following points:
PowerPoint: Venn Diagrams
Please click on the link labeled “Venn Diagrams” to work through an example.
In these Venn Diagrams, you can see why it is best for the predictors to be strongly correlated with the dependent variable but uncorrelated with the other Independent Variables. This reduces the amount of shared variance between the independent variables. The illustration in Slide 2 shows logical relationships between predictors, for two different possible regression models in separate Venn diagrams. On the left, you can see three partially correlated independent variables on a single dependent variable. The three partially correlated independent variables are physical health, mental health, and spiritual health and the dependent variable is life satisfaction. On the right, you have three highly correlated independent variables (e.g., BMI, blood pressure, heart rate) on the dependent variable of life satisfaction. The model on the left would have some use in discovering the associations between those variables, however, the model on the right would not be useful, as all three of the independent variables are basically measuring the same thing and are mostly accounting for the same variability in the dependent variable.
There are two main types of regression with multiple independent variables:
We will now be exploring the single step multiple regression:
All predictors enter the regression equation at once. Each predictor is treated as if it had been analysed in the regression model after all other predictors had been analysed. These predictors are evaluated by the shared variance (i.e., level of prediction) shared between the dependant variable and the individual predictor variable.
There are a number of assumptions that should be assessed before performing a multiple regression analysis:
For our example research question, we will be looking at the combined effect of three predictor variables – perceived life stress, location, and age – on the outcome variable of physical health?
PowerPoint: Standard Regression
Please open the output at the link labeled “Chapter Five – Standard Regression” to view the output.
Slide 1 contains the standard regression analysis output.
On Slide 2 you can see in the red circle, the test statistics are significant. The F-statistic examines the overall significance of the model, and shows if your predictors as a group provide a better fit to the data than no predictor variables, which they do in this example.
The R 2 values are shown in the green circle. The R 2 value shows the total amount of variance accounted for in the criterion by the predictors, and the adjusted R 2 is the estimated value of R 2 in the population.
Moving on to the individual variable effects on Slide 3, you can see the significance of the contribution of individual predictors in light blue. The unstandardized slope or the B value is shown in red, which represents the change caused by the variable (e.g., increasing 1 unit of perceived stress will raise physical illness by .40). Finally, you can see the standardised slope value in green, which are also known as beta values. These values are standardised ranging from +/-0 to 1, similar to an r value.
We should also briefly discuss dummy variables:
A dummy variable is a variable that is used to represent categorical information relating to the participants in a study. This could include gender, location, race, age groups, and you get the idea. Dummy variables are most often represented as dichotomous variables (they only have two values). When performing a regression, it is easier for interpretation if the values for the dummy variable is set to 0 or 1. 1 usually resents when a characteristic is present. For example, a question asking the participants “Do you have a drivers license” with a forced choice response of yes or no.
In this example on Slide 3 and circled in red, the variable is gender with male = 0, and female = 1. A positive Beta (B) means an association with 1, whereas a negative beta means an association with 0. In this case, being female was associated with greater levels of physical illness.
Here is an example of how to write up the results of a standard multiple regression analysis:
In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable. Overall, the results showed the utility of the predictive model was significant, F (3,363) = 39.61, R 2 = .25, p < .001. All of the predictors explain a large amount of the variance between the variables (25%). The results showed that perceived stress and gender of participants were significant positive predictors of physical illness ( β =.47, t = 9.96, p < .001, and β =.15, t = 3.23, p = .001, respectively). The results showed that age ( β =-.02, t = -0.49 p = .63) was not a significant predictor of perceived stress.
Statistics for Research Students Copyright © 2022 by University of Southern Queensland is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Peer-reviewed
Research Article
Affiliations Department of Psychology, University of Gothenburg, Gothenburg, Sweden, Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden
Affiliation Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden
Affiliations Department of Psychology, University of Gothenburg, Gothenburg, Sweden, Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden, Department of Psychology, Education and Sport Science, Linneaus University, Kalmar, Sweden
* E-mail: [email protected]
Affiliations Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden, Center for Ethics, Law, and Mental Health (CELAM), University of Gothenburg, Gothenburg, Sweden, Institute of Neuroscience and Physiology, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
23 Sep 2013: Nima AA, Rosenberg P, Archer T, Garcia D (2013) Correction: Anxiety, Affect, Self-Esteem, and Stress: Mediation and Moderation Effects on Depression. PLOS ONE 8(9): 10.1371/annotation/49e2c5c8-e8a8-4011-80fc-02c6724b2acc. https://doi.org/10.1371/annotation/49e2c5c8-e8a8-4011-80fc-02c6724b2acc View correction
Mediation analysis investigates whether a variable (i.e., mediator) changes in regard to an independent variable, in turn, affecting a dependent variable. Moderation analysis, on the other hand, investigates whether the statistical interaction between independent variables predict a dependent variable. Although this difference between these two types of analysis is explicit in current literature, there is still confusion with regard to the mediating and moderating effects of different variables on depression. The purpose of this study was to assess the mediating and moderating effects of anxiety, stress, positive affect, and negative affect on depression.
Two hundred and two university students (males = 93, females = 113) completed questionnaires assessing anxiety, stress, self-esteem, positive and negative affect, and depression. Mediation and moderation analyses were conducted using techniques based on standard multiple regression and hierarchical regression analyses.
The results indicated that (i) anxiety partially mediated the effects of both stress and self-esteem upon depression, (ii) that stress partially mediated the effects of anxiety and positive affect upon depression, (iii) that stress completely mediated the effects of self-esteem on depression, and (iv) that there was a significant interaction between stress and negative affect, and between positive affect and negative affect upon depression.
The study highlights different research questions that can be investigated depending on whether researchers decide to use the same variables as mediators and/or moderators.
Citation: Nima AA, Rosenberg P, Archer T, Garcia D (2013) Anxiety, Affect, Self-Esteem, and Stress: Mediation and Moderation Effects on Depression. PLoS ONE 8(9): e73265. https://doi.org/10.1371/journal.pone.0073265
Editor: Ben J. Harrison, The University of Melbourne, Australia
Received: February 21, 2013; Accepted: July 22, 2013; Published: September 9, 2013
Copyright: © 2013 Nima et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors have no support or funding to report.
Competing interests: The authors have declared that no competing interests exist.
Mediation refers to the covariance relationships among three variables: an independent variable (1), an assumed mediating variable (2), and a dependent variable (3). Mediation analysis investigates whether the mediating variable accounts for a significant amount of the shared variance between the independent and the dependent variables–the mediator changes in regard to the independent variable, in turn, affecting the dependent one [1] , [2] . On the other hand, moderation refers to the examination of the statistical interaction between independent variables in predicting a dependent variable [1] , [3] . In contrast to the mediator, the moderator is not expected to be correlated with both the independent and the dependent variable–Baron and Kenny [1] actually recommend that it is best if the moderator is not correlated with the independent variable and if the moderator is relatively stable, like a demographic variable (e.g., gender, socio-economic status) or a personality trait (e.g., affectivity).
Although both types of analysis lead to different conclusions [3] and the distinction between statistical procedures is part of the current literature [2] , there is still confusion about the use of moderation and mediation analyses using data pertaining to the prediction of depression. There are, for example, contradictions among studies that investigate mediating and moderating effects of anxiety, stress, self-esteem, and affect on depression. Depression, anxiety and stress are suggested to influence individuals' social relations and activities, work, and studies, as well as compromising decision-making and coping strategies [4] , [5] , [6] . Successfully coping with anxiety, depressiveness, and stressful situations may contribute to high levels of self-esteem and self-confidence, in addition increasing well-being, and psychological and physical health [6] . Thus, it is important to disentangle how these variables are related to each other. However, while some researchers perform mediation analysis with some of the variables mentioned here, other researchers conduct moderation analysis with the same variables. Seldom are both moderation and mediation performed on the same dataset. Before disentangling mediation and moderation effects on depression in the current literature, we briefly present the methodology behind the analysis performed in this study.
Baron and Kenny [1] postulated several criteria for the analysis of a mediating effect: a significant correlation between the independent and the dependent variable, the independent variable must be significantly associated with the mediator, the mediator predicts the dependent variable even when the independent variable is controlled for, and the correlation between the independent and the dependent variable must be eliminated or reduced when the mediator is controlled for. All the criteria is then tested using the Sobel test which shows whether indirect effects are significant or not [1] , [7] . A complete mediating effect occurs when the correlation between the independent and the dependent variable are eliminated when the mediator is controlled for [8] . Analyses of mediation can, for example, help researchers to move beyond answering if high levels of stress lead to high levels of depression. With mediation analysis researchers might instead answer how stress is related to depression.
In contrast to mediation, moderation investigates the unique conditions under which two variables are related [3] . The third variable here, the moderator, is not an intermediate variable in the causal sequence from the independent to the dependent variable. For the analysis of moderation effects, the relation between the independent and dependent variable must be different at different levels of the moderator [3] . Moderators are included in the statistical analysis as an interaction term [1] . When analyzing moderating effects the variables should first be centered (i.e., calculating the mean to become 0 and the standard deviation to become 1) in order to avoid problems with multi-colinearity [8] . Moderating effects can be calculated using multiple hierarchical linear regressions whereby main effects are presented in the first step and interactions in the second step [1] . Analysis of moderation, for example, helps researchers to answer when or under which conditions stress is related to depression.
Cognitive vulnerability models suggest that maladaptive self-schema mirroring helplessness and low self-esteem explain the development and maintenance of depression (for a review see [9] ). These cognitive vulnerability factors become activated by negative life events or negative moods [10] and are suggested to interact with environmental stressors to increase risk for depression and other emotional disorders [11] , [10] . In this line of thinking, the experience of stress, low self-esteem, and negative emotions can cause depression, but also be used to explain how (i.e., mediation) and under which conditions (i.e., moderation) specific variables influence depression.
Using mediational analyses to investigate how cognitive therapy intervations reduced depression, researchers have showed that the intervention reduced anxiety, which in turn was responsible for 91% of the reduction in depression [12] . In the same study, reductions in depression, by the intervention, accounted only for 6% of the reduction in anxiety. Thus, anxiety seems to affect depression more than depression affects anxiety and, together with stress, is both a cause of and a powerful mediator influencing depression (See also [13] ). Indeed, there are positive relationships between depression, anxiety and stress in different cultures [14] . Moreover, while some studies show that stress (independent variable) increases anxiety (mediator), which in turn increased depression (dependent variable) [14] , other studies show that stress (moderator) interacts with maladaptive self-schemata (dependent variable) to increase depression (independent variable) [15] , [16] .
In order to illustrate how mediation and moderation can be used to address different research questions we first focus our attention to anxiety and stress as mediators of different variables that earlier have been shown to be related to depression. Secondly, we use all variables to find which of these variables moderate the effects on depression.
The specific aims of the present study were:
This research protocol was approved by the Ethics Committee of the University of Gothenburg and written informed consent was obtained from all the study participants.
The present study was based upon a sample of 206 participants (males = 93, females = 113). All the participants were first year students in different disciplines at two universities in South Sweden. The mean age for the male students was 25.93 years ( SD = 6.66), and 25.30 years ( SD = 5.83) for the female students.
In total, 206 questionnaires were distributed to the students. Together 202 questionnaires were responded to leaving a total dropout of 1.94%. This dropout concerned three sections that the participants chose not to respond to at all, and one section that was completed incorrectly. None of these four questionnaires was included in the analyses.
Hospital anxiety and depression scale [17] ..
The Swedish translation of this instrument [18] was used to measure anxiety and depression. The instrument consists of 14 statements (7 of which measure depression and 7 measure anxiety) to which participants are asked to respond grade of agreement on a Likert scale (0 to 3). The utility, reliability and validity of the instrument has been shown in multiple studies (e.g., [19] ).
The Swedish version [21] of this instrument was used to measures individuals' experience of stress. The instrument consist of 14 statements to which participants rate on a Likert scale (0 = never , 4 = very often ). High values indicate that the individual expresses a high degree of stress.
The Rosenberg's Self-Esteem Scale (Swedish version by Lindwall [23] ) consists of 10 statements focusing on general feelings toward the self. Participants are asked to report grade of agreement in a four-point Likert scale (1 = agree not at all, 4 = agree completely ). This is the most widely used instrument for estimation of self-esteem with high levels of reliability and validity (e.g., [24] , [25] ).
This is a widely applied instrument for measuring individuals' self-reported mood and feelings. The Swedish version has been used among participants of different ages and occupations (e.g., [27] , [28] , [29] ). The instrument consists of 20 adjectives, 10 positive affect (e.g., proud, strong) and 10 negative affect (e.g., afraid, irritable). The adjectives are rated on a five-point Likert scale (1 = not at all , 5 = very much ). The instrument is a reliable, valid, and effective self-report instrument for estimating these two important and independent aspects of mood [26] .
Questionnaires were distributed to the participants on several different locations within the university, including the library and lecture halls. Participants were asked to complete the questionnaire after being informed about the purpose and duration (10–15 minutes) of the study. Participants were also ensured complete anonymity and informed that they could end their participation whenever they liked.
Depression showed positive, significant relationships with anxiety, stress and negative affect. Table 1 presents the correlation coefficients, mean values and standard deviations ( sd ), as well as Cronbach ' s α for all the variables in the study.
https://doi.org/10.1371/journal.pone.0073265.t001
Regression analyses were performed in order to investigate if anxiety mediated the effect of stress, self-esteem, and affect on depression (aim 1). The first regression showed that stress ( B = .03, 95% CI [.02,.05], β = .36, t = 4.32, p <.001), self-esteem ( B = −.03, 95% CI [−.05, −.01], β = −.24, t = −3.20, p <.001), and positive affect ( B = −.02, 95% CI [−.05, −.01], β = −.19, t = −2.93, p = .004) had each an unique effect on depression. Surprisingly, negative affect did not predict depression ( p = 0.77) and was therefore removed from the mediation model, thus not included in further analysis.
The second regression tested whether stress, self-esteem and positive affect uniquely predicted the mediator (i.e., anxiety). Stress was found to be positively associated ( B = .21, 95% CI [.15,.27], β = .47, t = 7.35, p <.001), whereas self-esteem was negatively associated ( B = −.29, 95% CI [−.38, −.21], β = −.42, t = −6.48, p <.001) to anxiety. Positive affect, however, was not associated to anxiety ( p = .50) and was therefore removed from further analysis.
A hierarchical regression analysis using depression as the outcome variable was performed using stress and self-esteem as predictors in the first step, and anxiety as predictor in the second step. This analysis allows the examination of whether stress and self-esteem predict depression and if this relation is weaken in the presence of anxiety as the mediator. The result indicated that, in the first step, both stress ( B = .04, 95% CI [.03,.05], β = .45, t = 6.43, p <.001) and self-esteem ( B = .04, 95% CI [.03,.05], β = .45, t = 6.43, p <.001) predicted depression. When anxiety (i.e., the mediator) was controlled for predictability was reduced somewhat but was still significant for stress ( B = .03, 95% CI [.02,.04], β = .33, t = 4.29, p <.001) and for self-esteem ( B = −.03, 95% CI [−.05, −.01], β = −.20, t = −2.62, p = .009). Anxiety, as a mediator, predicted depression even when both stress and self-esteem were controlled for ( B = .05, 95% CI [.02,.08], β = .26, t = 3.17, p = .002). Anxiety improved the prediction of depression over-and-above the independent variables (i.e., stress and self-esteem) (Δ R 2 = .03, F (1, 198) = 10.06, p = .002). See Table 2 for the details.
https://doi.org/10.1371/journal.pone.0073265.t002
A Sobel test was conducted to test the mediating criteria and to assess whether indirect effects were significant or not. The result showed that the complete pathway from stress (independent variable) to anxiety (mediator) to depression (dependent variable) was significant ( z = 2.89, p = .003). The complete pathway from self-esteem (independent variable) to anxiety (mediator) to depression (dependent variable) was also significant ( z = 2.82, p = .004). Thus, indicating that anxiety partially mediates the effects of both stress and self-esteem on depression. This result may indicate also that both stress and self-esteem contribute directly to explain the variation in depression and indirectly via experienced level of anxiety (see Figure 1 ).
Changes in Beta weights when the mediator is present are highlighted in red.
https://doi.org/10.1371/journal.pone.0073265.g001
For the second aim, regression analyses were performed in order to test if stress mediated the effect of anxiety, self-esteem, and affect on depression. The first regression showed that anxiety ( B = .07, 95% CI [.04,.10], β = .37, t = 4.57, p <.001), self-esteem ( B = −.02, 95% CI [−.05, −.01], β = −.18, t = −2.23, p = .03), and positive affect ( B = −.03, 95% CI [−.04, −.02], β = −.27, t = −4.35, p <.001) predicted depression independently of each other. Negative affect did not predict depression ( p = 0.74) and was therefore removed from further analysis.
The second regression investigated if anxiety, self-esteem and positive affect uniquely predicted the mediator (i.e., stress). Stress was positively associated to anxiety ( B = 1.01, 95% CI [.75, 1.30], β = .46, t = 7.35, p <.001), negatively associated to self-esteem ( B = −.30, 95% CI [−.50, −.01], β = −.19, t = −2.90, p = .004), and a negatively associated to positive affect ( B = −.33, 95% CI [−.46, −.20], β = −.27, t = −5.02, p <.001).
A hierarchical regression analysis using depression as the outcome and anxiety, self-esteem, and positive affect as the predictors in the first step, and stress as the predictor in the second step, allowed the examination of whether anxiety, self-esteem and positive affect predicted depression and if this association would weaken when stress (i.e., the mediator) was present. In the first step of the regression anxiety ( B = .07, 95% CI [.05,.10], β = .38, t = 5.31, p = .02), self-esteem ( B = −.03, 95% CI [−.05, −.01], β = −.18, t = −2.41, p = .02), and positive affect ( B = −.03, 95% CI [−.04, −.02], β = −.27, t = −4.36, p <.001) significantly explained depression. When stress (i.e., the mediator) was controlled for, predictability was reduced somewhat but was still significant for anxiety ( B = .05, 95% CI [.02,.08], β = .05, t = 4.29, p <.001) and for positive affect ( B = −.02, 95% CI [−.04, −.01], β = −.20, t = −3.16, p = .002), whereas self-esteem did not reach significance ( p < = .08). In the second step, the mediator (i.e., stress) predicted depression even when anxiety, self-esteem, and positive affect were controlled for ( B = .02, 95% CI [.08,.04], β = .25, t = 3.07, p = .002). Stress improved the prediction of depression over-and-above the independent variables (i.e., anxiety, self-esteem and positive affect) (Δ R 2 = .02, F (1, 197) = 9.40, p = .002). See Table 3 for the details.
https://doi.org/10.1371/journal.pone.0073265.t003
Furthermore, the Sobel test indicated that the complete pathways from the independent variables (anxiety: z = 2.81, p = .004; self-esteem: z = 2.05, p = .04; positive affect: z = 2.58, p <.01) to the mediator (i.e., stress), to the outcome (i.e., depression) were significant. These specific results might be explained on the basis that stress partially mediated the effects of both anxiety and positive affect on depression while stress completely mediated the effects of self-esteem on depression. In other words, anxiety and positive affect contributed directly to explain the variation in depression and indirectly via the experienced level of stress. Self-esteem contributed only indirectly via the experienced level of stress to explain the variation in depression. In other words, stress effects on depression originate from “its own power” and explained more of the variation in depression than self-esteem (see Figure 2 ).
https://doi.org/10.1371/journal.pone.0073265.g002
Multiple linear regression analyses were used in order to examine moderation effects between anxiety, stress, self-esteem and affect on depression. The analysis indicated that about 52% of the variation in the dependent variable (i.e., depression) could be explained by the main effects and the interaction effects ( R 2 = .55, adjusted R 2 = .51, F (55, 186) = 14.87, p <.001). When the variables (dependent and independent) were standardized, both the standardized regression coefficients beta (β) and the unstandardized regression coefficients beta (B) became the same value with regard to the main effects. Three of the main effects were significant and contributed uniquely to high levels of depression: anxiety ( B = .26, t = 3.12, p = .002), stress ( B = .25, t = 2.86, p = .005), and self-esteem ( B = −.17, t = −2.17, p = .03). The main effect of positive affect was also significant and contributed to low levels of depression ( B = −.16, t = −2.027, p = .02) (see Figure 3 ). Furthermore, the results indicated that two moderator effects were significant. These were the interaction between stress and negative affect ( B = −.28, β = −.39, t = −2.36, p = .02) (see Figure 4 ) and the interaction between positive affect and negative affect ( B = −.21, β = −.29, t = −2.30, p = .02) ( Figure 5 ).
https://doi.org/10.1371/journal.pone.0073265.g003
Low stress and low negative affect leads to lower levels of depression compared to high stress and high negative affect.
https://doi.org/10.1371/journal.pone.0073265.g004
High positive affect and low negative affect lead to lower levels of depression compared to low positive affect and high negative affect.
https://doi.org/10.1371/journal.pone.0073265.g005
The results in the present study show that (i) anxiety partially mediated the effects of both stress and self-esteem on depression, (ii) that stress partially mediated the effects of anxiety and positive affect on depression, (iii) that stress completely mediated the effects of self-esteem on depression, and (iv) that there was a significant interaction between stress and negative affect, and positive affect and negative affect on depression.
The study suggests that anxiety contributes directly to explaining the variance in depression while stress and self-esteem might contribute directly to explaining the variance in depression and indirectly by increasing feelings of anxiety. Indeed, individuals who experience stress over a long period of time are susceptible to increased anxiety and depression [30] , [31] and previous research shows that high self-esteem seems to buffer against anxiety and depression [32] , [33] . The study also showed that stress partially mediated the effects of both anxiety and positive affect on depression and that stress completely mediated the effects of self-esteem on depression. Anxiety and positive affect contributed directly to explain the variation in depression and indirectly to the experienced level of stress. Self-esteem contributed only indirectly via the experienced level of stress to explain the variation in depression, i.e. stress affects depression on the basis of ‘its own power’ and explains much more of the variation in depressive experiences than self-esteem. In general, individuals who experience low anxiety and frequently experience positive affect seem to experience low stress, which might reduce their levels of depression. Academic stress, for instance, may increase the risk for experiencing depression among students [34] . Although self-esteem did not emerged as an important variable here, under circumstances in which difficulties in life become chronic, some researchers suggest that low self-esteem facilitates the experience of stress [35] .
The present study showed that the interaction between stress and negative affect and between positive and negative affect influenced self-reported depression symptoms. Moderation effects between stress and negative affect imply that the students experiencing low levels of stress and low negative affect reported lower levels of depression than those who experience high levels of stress and high negative affect. This result confirms earlier findings that underline the strong positive association between negative affect and both stress and depression [36] , [37] . Nevertheless, negative affect by itself did not predicted depression. In this regard, it is important to point out that the absence of positive emotions is a better predictor of morbidity than the presence of negative emotions [38] , [39] . A modification to this statement, as illustrated by the results discussed next, could be that the presence of negative emotions in conjunction with the absence of positive emotions increases morbidity.
The moderating effects between positive and negative affect on the experience of depression imply that the students experiencing high levels of positive affect and low levels of negative affect reported lower levels of depression than those who experience low levels of positive affect and high levels of negative affect. This result fits previous observations indicating that different combinations of these affect dimensions are related to different measures of physical and mental health and well-being, such as, blood pressure, depression, quality of sleep, anxiety, life satisfaction, psychological well-being, and self-regulation [40] – [51] .
The result indicated a relatively low mean value for depression ( M = 3.69), perhaps because the studied population was university students. These might limit the generalization power of the results and might also explain why negative affect, commonly associated to depression, was not related to depression in the present study. Moreover, there is a potential influence of single source/single method variance on the findings, especially given the high correlation between all the variables under examination.
The present study highlights different results that could be arrived depending on whether researchers decide to use variables as mediators or moderators. For example, when using meditational analyses, anxiety and stress seem to be important factors that explain how the different variables used here influence depression–increases in anxiety and stress by any other factor seem to lead to increases in depression. In contrast, when moderation analyses were used, the interaction of stress and affect predicted depression and the interaction of both affectivity dimensions (i.e., positive and negative affect) also predicted depression–stress might increase depression under the condition that the individual is high in negative affectivity, in turn, negative affectivity might increase depression under the condition that the individual experiences low positive affectivity.
The authors would like to thank the reviewers for their openness and suggestions, which significantly improved the article.
Conceived and designed the experiments: AAN TA. Performed the experiments: AAN. Analyzed the data: AAN DG. Contributed reagents/materials/analysis tools: AAN TA DG. Wrote the paper: AAN PR TA DG.
Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.
To get started, let's read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:
We'll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.
Before going further you may wish to explore the data using the summary() and pairs() functions.
Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and "binds" them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition but rather inclusion. Taken together the formula cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS translates to "model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS." To fit this model we use the workhorse lm() function and save it to an object we name "mlm1". Finally we view the results with summary() .
Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if we modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They're identical.
The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis .
We can use R's extractor functions with our mlm1 object, except we'll get double the output. For example, instead of one set of residuals, we get two:
Instead of one set of fitted values, we get two:
Instead of one set of coefficients, we get two:
Instead of one residual standard error, we get two:
Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don't reproduce the output here because of the size, but we encourage you to view it for yourself:
The main takeaway is that the coefficients from both models covary . That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it's not enough to eyeball the results from the two separate regressions. We should formally test for their inclusion. And that test involves the covariances between the coefficients in both models.
Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term "multivariate" here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.
The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:
The results are titled "Type II MANOVA Tests". The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. "Type II" refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.
Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little "a"; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression . ~ . - PR - DIAP - QRS says "keep the same responses and predictors except PR, DIAP and QRS."
The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is "Pillai", which is one of the four common multivariate test statistics.
The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.
The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).
Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is $$ \frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}} $$
In R we can calculate that as follows:
Likewise the formula for Pillai is $$ tr[\bf{H}(\bf{H} + \bf{E})^{-1}] $$ tr means trace. That's the sum of the diagonal elements of a matrix. In R we can calculate as follows:
The formula for Hotelling-Lawley is $$ tr[\bf{H}\bf{E}^{-1}] $$ In R:
And finally the Roy statistics is the largest eigenvalue of \(\bf{H}\bf{E}^{-1}\). In R code:
Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary:
Now let's say we wanted to use this model to estimate mean TOT and AMI values for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data.
This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don't calculate an interval but rather an ellipse to capture the uncertainty in two dimensions.
Unfortunately at the time of this writing there doesn't appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However, we have written one below you can use called confidenceEllipse() . The details of the function go beyond a "getting started" blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics.
Here's a demonstration of the function.
The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We're 95% confident the true mean values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa.
Clay Ford Statistical Research Consultant University of Virginia Library October 27, 2017 Updated May 26, 2023 Update February 20, 2024 (changed function name)
For questions or clarifications regarding this article, contact [email protected] .
View the entire collection of UVA Library StatLab articles, or learn how to cite .
Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!
Introduction.
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).
For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.
Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.
This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.
Assumptions.
When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:
You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.
In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.
A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.
In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .
The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).
Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .
Published with written permission from SPSS Statistics, IBM Corporation.
Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.
SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.
However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:
The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:
The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.
The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).
The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:
predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )
This is obtained from the Coefficients table, as shown below:
Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.
You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:
You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.
You could write up the results as follows:
A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.
If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.
Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.
Investopedia / Nez Riaz
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable.
y i = β 0 + β 1 x i 1 + β 2 x i 2 + . . . + β p x i p + ϵ where, for i = n observations: y i = dependent variable x i = explanatory variables β 0 = y-intercept (constant term) β p = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals) \begin{aligned}&y_i = \beta_0 + \beta _1 x_{i1} + \beta _2 x_{i2} + ... + \beta _p x_{ip} + \epsilon\\&\textbf{where, for } i = n \textbf{ observations:}\\&y_i=\text{dependent variable}\\&x_i=\text{explanatory variables}\\&\beta_0=\text{y-intercept (constant term)}\\&\beta_p=\text{slope coefficients for each explanatory variable}\\&\epsilon=\text{the model's error term (also known as the residuals)}\end{aligned} y i = β 0 + β 1 x i 1 + β 2 x i 2 + . . . + β p x i p + ϵ where, for i = n observations: y i = dependent variable x i = explanatory variables β 0 = y-intercept (constant term) β p = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals)
Simple linear regression is a function that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables—an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.
The multiple regression model is based on the following assumptions:
The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the variation in outcome can be explained by the variation in the independent variables. R 2 always increases as more predictors are added to the MLR model, even though the predictors may not be related to the outcome variable.
R 2 by itself can't thus be used to identify which predictors should be included in a model and which should be excluded. R 2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the independent variables and 1 indicates that the outcome can be predicted without error from the independent variables.
When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation, or vertically in table form.
As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil (XOM). In this case, their linear equation will have the value of the S&P 500 index as the independent variable, or predictor, and the price of XOM as the dependent variable.
In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for example, depends on more than just the performance of the overall market. Other predictors such as the price of oil, interest rates, and the price movement of oil futures can affect the price of Exon Mobil ( XOM ) and the stock prices of other oil companies. To understand a relationship in which more than two variables are present, multiple linear regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among several random variables. In other terms, MLR examines how multiple independent variables are related to one dependent variable. Once each of the independent factors has been determined to predict the dependent variable, the information on the multiple variables can be used to create an accurate prediction on the level of effect they have on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points.
Referring to the MLR equation above, in our example:
The least-squares estimates—B 0 , B 1 , B 2 …B p —are usually computed by statistical software. As many variables can be included in the regression model in which each independent variable is differentiated with a number—1,2, 3, 4...p. The multiple regression model allows an analyst to predict an outcome based on information provided on multiple explanatory variables.
Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome predicted by the model. The residual value, E, which is the difference between the actual outcome and the predicted outcome, is included in the model to account for such slight variations.
Assuming we run our XOM price regression model through a statistics computation software, that returns this output:
An analyst would interpret this output to mean if other variables are held constant, the price of XOM will increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows that the price of XOM will decrease by 1.5% following a 1% rise in interest rates. R 2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained by changes in the interest rate, oil price, oil futures, and S&P 500 index.
Ordinary linear squares (OLS) regression compares the response of a dependent variable given a change in some explanatory variables. However, a dependent variable is rarely explained by only one variable. In this case, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. Multiple regressions can be linear and nonlinear.
Multiple regressions are based on the assumption that there is a linear relationship between both the dependent and independent variables. It also assumes no major correlation between the independent variables.
A multiple regression considers the effect of more than one explanatory variable on some outcome of interest. It evaluates the relative effect of these explanatory, or independent, variables on the dependent variable when holding all the other variables in the model constant.
A dependent variable is rarely explained by only one variable. In such cases, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. The model, however, assumes that there are no major correlations between the independent variables.
It's unlikely as multiple regression models are complex and become even more so when there are more variables included in the model or when the amount of data to analyze grows. To run a multiple regression you will likely need to use specialized statistical software or functions within programs like Excel.
In multiple linear regression, the model calculates the line of best fit that minimizes the variances of each of the variables included as it relates to the dependent variable. Because it fits a line, it is a linear model. There are also non-linear regression models involving multiple variables, such as logistic regression, quadratic regression, and probit models.
Any econometric model that looks at more than one variable may be a multiple. Factor models compare two or more factors to analyze relationships between variables and the resulting performance. The Fama and French Three-Factor Mod is such a model that expands on the capital asset pricing model (CAPM) by adding size risk and value risk factors to the market risk factor in CAPM (which is itself a regression model). By including these two additional factors, the model adjusts for this outperforming tendency, which is thought to make it a better tool for evaluating manager performance.
Boston University Medical Campus-School of Public Health. " Multiple Linear Regression ."
We discuss regression analysis with multiple predictor variables, detailing especially the case with two predictor variables.
This is a preview of subscription content, log in via an institution to check access.
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Yanai, Takagi (Eds.) (1986) Handbook of Multivariate Analysis, Contemporary Mathematics Corporation.
John Kruschke (2014) Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan. Academic Press. Kazuhisa Maeda, Koji Kosugi (Trans.) (2017) Bayesian Statistical Modeling: A Tutorial with R, JAGS, and Stan, 2nd Edition, Kyoritsu Shuppan. Chapter 12 .
Authors and affiliations.
Department of Psychology, Waseda University, Shinjyuku-ku, Tokyo, Japan
Hideki Toyoda
You can also search for this author in PubMed Google Scholar
Correspondence to Hideki Toyoda .
Reprints and permissions
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Toyoda, H. (2024). Multiple Regression Model. In: Statistics with Posterior Probability and a PHC Curve. Springer, Singapore. https://doi.org/10.1007/978-981-97-3094-0_15
DOI : https://doi.org/10.1007/978-981-97-3094-0_15
Published : 15 June 2024
Publisher Name : Springer, Singapore
Print ISBN : 978-981-97-3093-3
Online ISBN : 978-981-97-3094-0
eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Policies and ethics
A newly updated, ad-free video version of this tutorial is included in our SPSS beginners course .
Data checks and descriptive statistics, spss regression dialogs, spss multiple regression output, multiple regression assumptions, apa reporting multiple regression.
A scientist wants to know if and how health care costs can be predicted from several patient characteristics. All data are in health-costs.sav as shown below.
Our scientist thinks that each independent variable has a linear relation with health care costs. He therefore decides to fit a multiple linear regression model. The final model will predict costs from all independent variables simultaneously.
Before running multiple regression, first make sure that
A visual inspection of our data shows that requirements 1 and 2 are met: sex is a dichotomous variable and all other relevant variables are quantitative. Regarding sample size, a general rule of thumb is that you want to use at least 15 independent observations for each independent variable you'll include. In our example, we'll use 5 independent variables so we need a sample size of at least N = (5 · 15 =) 75 cases. Our data contain 525 cases so this seems fine.
Keep in mind, however, that we may not be able to use all N = 525 cases if there's any missing values in our variables.
Let's now proceed with some quick data checks. I strongly encourage you to at least
The APA recommends you combine and report these last two tables as shown below.
These data checks show that our example data look perfectly fine: all charts are plausible, there's no missing values and none of the correlations exceed 0.43. Let's now proceed with the actual regression analysis.
Next, we fill out the main dialog and subdialogs as shown below.
The first table we inspect is the Coefficients table shown below.
$$Costs' = -3263.6 + 509.3 \cdot Sex + 114.7 \cdot Age + 50.4 \cdot Alcohol\\ + 139.4 \cdot Cigarettes - 271.3 \cdot Exericse$$
where \(Costs'\) denotes predicted yearly health care costs in dollars.
Each b-coefficient indicates the average increase in costs associated with a 1-unit increase in a predictor. For example, a 1-year increase in age results in an average $114.7 increase in costs. Or a 1 hour increase in exercise per week is associated with a -$271.3 increase (that is, a $271.3 decrease) in yearly health costs.
Now, let's talk about sex : a 1-unit increase in sex results in an average $509.3 increase in costs. For understanding what this means, please note that sex is coded 0 (female) and 1 (male) in our example data. So for this variable, the only possible 1-unit increase is from female (0) to male (1). Therefore, B = $509.3 simply means that the average yearly costs for males are $509.3 higher than for females (everything else equal, that is). This hopefully clarifies how dichotomous variables can be used in multiple regression. We'll expand on this idea when we'll cover dummy variables in a later tutorial.
Now, our b-coefficients don't tell us the relative strengths of our predictors. This is because these have different scales: is a cigarette per day more or less than an alcoholic beverage per week? One way to deal with this, is to compare the standardized regression coefficients or beta coefficients, often denoted as β (the Greek letter “beta”). In statistics, β also refers to the probability of committing a type II error in hypothesis testing. This is why (1 - β ) denotes power but that's a completely different topic than regression coefficients.
Beta coefficients are obtained by standardizing all regression variables into z-scores before computing b-coefficients. Standardizing variables applies a similar standard (or scale ) to them: the resulting z-scores always have mean of 0 and a standard deviation of 1. This holds regardless whether they're computed over years, cigarettes or alcoholic beverages. So that's why b-coefficients computed over standardized variables -beta coefficients- are comparable within and between regression models.
Right, so our b-coefficients make up our multiple regression model. This tells us how to predict yearly health care costs. What we don't know, however, is precisely how well does our model predict these costs? We'll find the answer in the model summary table discussed below.
Sadly, SPSS doesn't include a confidence interval for R 2 adj . However, the p-value found in the ANOVA table applies to R and R-square (the rest of this table is pretty useless). It evaluates the null hypothesis that our entire regression model has a population R of zero. Since p < 0.05, we reject this null hypothesis for our example data.
It seems we're done for this analysis but we skipped an important step: checking the multiple regression assumptions.
Our data checks started off with some basic requirements. However, the “official” multiple linear regression assumptions are
We'll check if our example analysis meets these assumptions by doing 3 things:
The easy way to obtain these 2 regression plots, is selecting them in the dialogs (shown below) and rerunning the regression analysis.
Residual plots i - histogram.
The histogram over our standardized residuals shows
In short, we do see some deviations from normality but they're tiny. Most analysts would conclude that the residuals are roughly normally distributed. If you're not convinced, you could add the residuals as a new variable to the data via the SPSS regression dialogs. Next, you could run a Shapiro-Wilk test or a Kolmogorov-Smirnov test on them. However, we don't generally recommend these tests.
The residual scatterplot shown below is often used for checking a) the homoscedasticity and b) the linearity assumptions. If both assumptions hold, this scatterplot shouldn't show any systematic pattern whatsoever. That seems to be the case here.
Homoscedasticity implies that the variance of the residuals should be constant. This variance can be estimated from how far the dots in our scatterplot lie apart vertically . Therefore, the height of our scatterplot should neither increase nor decrease as we move from left to right. We don't see any such pattern.
A common check for the linearity assumption is inspecting if the dots in this scatterplot show any kind of curve. That's not the case here so linearity also seems to hold here. On a personal note, however, I find this a very weak approach. An unusual (but much stronger) approach is to fit a variety of non linear regression models for each predictor separately. Doing so requires very little effort and often reveils non linearity. This can then be added to some linear model in order to improve its predictive accuracy. Sadly, this “low hanging fruit” is routinely overlooked because analysts usually limit themselves to the poor scatterplot aproach that we just discussed.
The APA reporting guidelines propose the table shown below for reporting a standard multiple regression analysis.
I think it's utter stupidity that the APA table doesn't include the constant for our regression model. I recommend you add it anyway. Furthermore, note that
in the SPSS output. Last, the APA also recommends reporting a combined descriptive statistics and correlations table like we saw here .
Thanks for reading!
This tutorial has 74 comments:.
In this analysis, we only deal with some numerical variables, but for categorical variables such as: region; marital et al did not participate in the analysis, so why don't we create dummy variables, put new variables into the analysis, and finally get results involving all relevant independent variables?
We didn't do that because the tutorial would become way too long like that.
Besides, multiple regression is often limited to quantitative variables and we don't want to discuss every imaginable sub-topic in a single tutorial.
We discussed the use of dummy variables in SPSS Dummy Variable Regression Tutorial .
Hope that helps!
SPSS tutorials
very helpful tutorial
It is very helpful.
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
https://doi.org/10.1136/ebnurs-2021-103425
Request permissions.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
A nurse educator is interested in finding out the academic and non-academic predictors of success in nursing students. Given the complexity of educational and clinical learning environments, demographic, clinical and academic factors (age, gender, previous educational training, personal stressors, learning demands, motivation, assignment workload, etc) influencing nursing students’ success, she was able to list various potential factors contributing towards success relatively easily. Nevertheless, not all of the identified factors will be plausible predictors of increased success. Therefore, she could use a powerful statistical procedure called regression analysis to identify whether the likelihood of increased success is influenced by factors such as age, stressors, learning demands, motivation and education.
Purposes of regression analysis.
Regression analysis has four primary purposes: description, estimation, prediction and control. 1 , 2 By description, regression can explain the relationship between dependent and independent variables. Estimation means that by using the observed values of independent variables, the value of dependent variable can be estimated. 2 Regression analysis can be useful for predicting the outcomes and changes in dependent variables based on the relationships of dependent and independent variables. Finally, regression enables in controlling the effect of one or more independent variables while investigating the relationship of one independent variable with the dependent variable. 1
There are commonly three types of regression analyses, namely, linear, logistic and multiple regression. The differences among these types are outlined in table 1 in terms of their purpose, nature of dependent and independent variables, underlying assumptions, and nature of curve. 1 , 3 However, more detailed discussion for linear regression is presented as follows.
Comparison of linear, logistic and multiple regression
Linear regression analysis involves examining the relationship between one independent and dependent variable. Statistically, the relationship between one independent variable (x) and a dependent variable (y) is expressed as: y= β 0 + β 1 x+ε. In this equation, β 0 is the y intercept and refers to the estimated value of y when x is equal to 0. The coefficient β 1 is the regression coefficient and denotes that the estimated increase in the dependent variable for every unit increase in the independent variable. The symbol ε is a random error component and signifies imprecision of regression indicating that, in actual practice, the independent variables are cannot perfectly predict the change in any dependent variable. 1 Multiple linear regression follows the same logic as univariate linear regression except (a) multiple regression, there are more than one independent variable and (b) there should be non-collinearity among the independent variables.
Linear and multiple regression analyses are affected by factors, namely, sample size, missing data and the nature of sample. 2
Small sample size may only demonstrate connections among variables with strong relationship. Therefore, sample size must be chosen based on the number of independent variables and expect strength of relationship.
Many missing values in the data set may affect the sample size. Therefore, all the missing values should be adequately dealt with before conducting regression analyses.
The subsamples within the larger sample may mask the actual effect of independent and dependent variables. Therefore, if subsamples are predefined, a regression within the sample could be used to detect true relationships. Otherwise, the analysis should be undertaken on the whole sample.
Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor. 4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction). Ali and Naylor’s chosen academic independent variables were preadmission qualification, previous academic performance and school type and the non-academic variables were age, gender, marital status and time gap. To achieve their purpose, they collected data from 628 nursing students between the age range of 15–34 years. They used both linear and multiple regression analyses to identify the predictors of student success. For analysis, they examined the relationship of academic and non-academic variables across different years of study and noted that academic factors accounted for 36.6%, 44.3% and 50.4% variability in academic success of students in year 1, year 2 and year 3, respectively. 4
Ali and Naylor presented the relationship among these variables using scatter plots, which are commonly used graphs for data display in regression analysis—see examples of various scatter plots in figure 1 . 4 In a scatter plot, the clustering of the dots denoted the strength of relationship, whereas the direction indicates the nature of relationships among variables as positive (ie, increase in one variable results in an increase in the other) and negative (ie, increase in one variable results in decrease in the other).
An Example of Scatter Plot for Regression.
Table 2 presents the results of regression analysis for academic and non-academic variables for year 4 students’ success. The significant predictors of student success are denoted with a significant p value. For every, significant predictor, the beta value indicates the percentage increase in students’ academic success with one unit increase in the variable.
Regression model for the final year students (N=343)
Regression analysis is a powerful and useful statistical procedure with many implications for nursing research. It enables researchers to describe, predict and estimate the relationships and draw plausible conclusions about the interrelated variables in relation to any studied phenomena. Regression also allows for controlling one or more variables when researchers are interested in examining the relationship among specific variables. Some of the key considerations are presented that may be useful for researchers undertaking regression analysis. While planning and conducting regression analysis, researchers should consider the type and number of dependent and independent variables as well as the nature and size of sample. Choosing a wrong type of regression analysis with small sample may result in erroneous conclusions about the studied phenomenon.
Patient consent for publication.
Not required.
Twitter @parveenazamali, @@Ahtisham04
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Commissioned; internally peer reviewed.
Root out friction in every digital experience, super-charge conversion rates, and optimise digital self-service
Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground
Know how your people feel and empower managers to improve employee engagement, productivity, and retention
Take action in the moments that matter most along the employee journey and drive bottom line growth
Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
Meet the operating system for experience management
Popular Use Cases
Market Research
The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results.
Regression analysis: the ultimate guide.
19 min read In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.
When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analysing what you find isn’t enough — you need to ensure it’s relevant and valuable.
The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.
And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between dependent variables and independent variables.
In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.
Free eBook: The ultimate guide to conducting market research
Regression analysis is a statistical method. It’s used for analysing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.
Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.
Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.
This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analszing the effects of these variables on your outcome, you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.
Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.
Regression analysis starts with variables that are categorised into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analysing.
1. dependent variable.
This is the main variable that you want to analyse and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .
These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).
There are three easy ways to identify them:
Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.
You can identify independent variables with the following list of questions:
Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:
Explanatory variables
Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.
Predictor variables
Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .
Experimental variables
These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.
Subject variables (also called fixed effects)
Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.
Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’
Carrying out regression analysis
So regression is about the relationships between dependent and independent variables. But how exactly do you do it?
Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.
Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.
On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.
This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.
Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.
A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.
This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).
While linear regression is useful, it does require you to make some assumptions.
For example, it requires you to assume that:
As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.
Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.
For example, you will be assuming that:
An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.
With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.
Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.
For example, if an organisation wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).
Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualise those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.
However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyse the data.
Logistic regression models the probability of a binary outcome based on independent variables.
So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.
Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.
What are some common mistakes with regression analysis?
Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyse it are reliable and accurate.
Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.
When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.
It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).
If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.
Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.
Benefits of using regression analysis
There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.
Here are just a few of those benefits:
Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.
Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.
Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.
For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.
Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.
Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.
This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?
There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.
Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.
A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,
The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.
By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.
This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.
By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.
Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.
Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.
The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.
To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualisations and then translates the results into simple language that anyone can put into action.
With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.
With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.
eBook: The ultimate guide to conducting market research
Market intelligence 9 min read, qualitative research questions 11 min read, ethnographic research 11 min read, business research methods 12 min read, qualitative research design 12 min read, business research 10 min read, qualitative research interviews 11 min read, request demo.
Ready to learn more about Qualtrics?
James Madison University *
Jun 12, 2024
Uploaded by BrigadierOctopusMaster942
Institute for Digital Research and Education
Introduction.
This page shows how to perform a number of statistical tests using SPSS. Each section gives a brief description of the aim of the statistical test, when it is used, an example showing the SPSS commands and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the page Choosing the Correct Statistical Test for a table that shows an overview of when each test is appropriate to use. In deciding which test is appropriate to use, it is important to consider the type of variables that you have (i.e., whether your variables are categorical, ordinal or interval and whether they are normally distributed), see What is the difference between categorical, ordinal and interval variables? for more information on this.
Most of the examples in this page will use a data file called hsb2, high school and beyond. This data file contains 200 observations from a sample of high school students with demographic information about the students, such as their gender ( female ), socio-economic status ( ses ) and ethnic background ( race ). It also contains a number of scores on standardized tests, including tests of reading ( read ), writing ( write ), mathematics ( math ) and social studies ( socst ). You can get the hsb data file by clicking on hsb2 .
A one sample t-test allows us to test whether a sample mean (of a normally distributed interval variable) significantly differs from a hypothesized value. For example, using the hsb2 data file , say we wish to test whether the average writing score ( write ) differs significantly from 50. We can do this as shown below. t-test /testval = 50 /variable = write. The mean of the variable write for this particular sample of students is 52.775, which is statistically significantly different from the test value of 50. We would conclude that this group of students has a significantly higher mean on the writing test than 50.
A one sample median test allows us to test whether a sample median differs significantly from a hypothesized value. We will use the same variable, write , as we did in the one sample t-test example above, but we do not need to assume that it is interval and normally distributed (we only need to assume that write is an ordinal variable). nptests /onesample test (write) wilcoxon(testvalue = 50).
A one sample binomial test allows us to test whether the proportion of successes on a two-level categorical dependent variable significantly differs from a hypothesized value. For example, using the hsb2 data file , say we wish to test whether the proportion of females ( female ) differs significantly from 50%, i.e., from .5. We can do this as shown below. npar tests /binomial (.5) = female. The results indicate that there is no statistically significant difference (p = .229). In other words, the proportion of females in this sample does not significantly differ from the hypothesized value of 50%.
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions. For example, let’s suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks. We want to test whether the observed proportions from our sample differ significantly from these hypothesized proportions. npar test /chisquare = race /expected = 10 10 10 70. These results show that racial composition in our sample does not differ significantly from the hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).
An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. For example, using the hsb2 data file , say we wish to test whether the mean for write is the same for males and females. t-test groups = female(0 1) /variables = write. Because the standard deviations for the two groups are similar (10.3 and 8.1), we will use the “equal variances assumed” test. The results indicate that there is a statistically significant difference between the mean writing score for males and females (t = -3.734, p = .000). In other words, females have a statistically significantly higher mean score on writing (54.99) than males (50.12). See also SPSS Learning Module: An overview of statistical tests in SPSS
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. We will use the same data file (the hsb2 data file ) and the same variables in this example as we did in the independent t-test example above and will not assume that write , our dependent variable, is normally distributed.
npar test /m-w = write by female(0 1). The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001). See also FAQ: Why is the Mann-Whitney significant when the medians are equal?
A chi-square test is used when you want to see if there is a relationship between two categorical variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command to obtain the test statistic and its associated p-value. Using the hsb2 data file , let’s see if there is a relationship between the type of school attended ( schtyp ) and students’ gender ( female ). Remember that the chi-square test assumes that the expected value for each cell is five or higher. This assumption is easily met in the examples below. However, if this assumption is not met in your data, please see the section on Fisher’s exact test below. crosstabs /tables = schtyp by female /statistic = chisq. These results indicate that there is no statistically significant relationship between the type of school attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828). Let’s look at another example, this time looking at the linear relationship between gender ( female ) and socio-economic status ( ses ). The point of this example is that one (or both) variables may have more than two levels, and that the variables do not have to have the same number of levels. In this example, female has two levels (male and female) and ses has three levels (low, medium and high). crosstabs /tables = female by ses /statistic = chisq. Again we find that there is no statistically significant relationship between the variables (chi-square with two degrees of freedom = 4.577, p = 0.101). See also SPSS Learning Module: An Overview of Statistical Tests in SPSS
The Fisher’s exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher’s exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS Exact Test Module, you can only perform a Fisher’s exact test on a 2×2 table, and these results are presented by default. Please see the results from the chi squared example above.
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable. For example, using the hsb2 data file , say we wish to test whether the mean of write differs between the three program types ( prog ). The command for this test would be: oneway write by prog. The mean of the dependent variable differs significantly among the levels of program type. However, we do not know if the difference is between only two of the levels or all three of the levels. (The F test for the Model is the same as the F test for prog because prog was the only variable entered into the model. If other variables had also been entered, the F test for the Model would have been different from prog .) To see the mean of write for each level of program type, means tables = write by prog. From this we can see that the students in the academic program have the highest mean writing score, while students in the vocational program have the lowest. See also SPSS Textbook Examples: Design and Analysis, Chapter 7 SPSS Textbook Examples: Applied Regression Analysis, Chapter 8 SPSS FAQ: How can I do ANOVA contrasts in SPSS? SPSS Library: Understanding and Interpreting Parameter Estimates in Regression and ANOVA
The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a generalized form of the Mann-Whitney test method since it permits two or more groups. We will use the same data file as the one way ANOVA example above (the hsb2 data file ) and the same variables as in the example above, but we will not assume that write is a normally distributed interval variable. npar tests /k-w = write by prog (1,3). If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different value of chi-squared. With or without ties, the results indicate that there is a statistically significant difference among the three type of programs.
A paired (samples) t-test is used when you have two related observations (i.e., two observations per subject) and you want to see if the means on these two normally distributed interval variables differ from one another. For example, using the hsb2 data file we will test whether the mean of read is equal to the mean of write . t-test pairs = read with write (paired). These results indicate that the mean of read is not statistically significantly different from the mean of write (t = -0.867, p = 0.387).
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal). We will use the same example as above, but we will not assume that the difference between read and write is interval and normally distributed. npar test /wilcoxon = write with read (paired). The results suggest that there is not a statistically significant difference between read and write . If you believe the differences between read and write were not ordinal but could merely be classified as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again, we will use the same variables in this example and assume that this difference is not ordinal. npar test /sign = read with write (paired). We conclude that no statistically significant difference was found (p=.556).
You would perform McNemar’s test if you were interested in the marginal frequencies of two binary outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group. Continuing with the hsb2 dataset used in several above examples, let us create two binary outcomes in our dataset: himath and hiread . These outcomes can be considered in a two-way contingency table. The null hypothesis is that the proportion of students in the himath group is the same as the proportion of students in hiread group (i.e., that the contingency table is symmetric). compute himath = (math>60). compute hiread = (read>60). execute. crosstabs /tables=himath BY hiread /statistic=mcnemar /cells=count. McNemar’s chi-square statistic suggests that there is not a statistically significant difference in the proportion of students in the himath group and the proportion of students in the hiread group.
You would perform a one-way repeated measures analysis of variance if you had one categorical independent variable and a normally distributed interval dependent variable that was repeated at least twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more levels of the categorical variable. This tests whether the mean of the dependent variable differs by the categorical variable. We have an example data set called rb4wide , which is used in Kirk’s book Experimental Design. In this data set, y is the dependent variable, a is the repeated measure and s is the variable that indicates the subject number. glm y1 y2 y3 y4 /wsfactor a(4). You will notice that this output gives four different p-values. The output labeled “sphericity assumed” is the p-value (0.000) that you would get if you assumed compound symmetry in the variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matter which p-value you use, our results indicate that we have a statistically significant effect of a at the .05 level. See also SPSS Textbook Examples from Design and Analysis: Chapter 16 SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA SPSS Code Fragment: Repeated Measures ANOVA
If you have a binary outcome measured repeatedly for each subject and you wish to run a logistic regression that accounts for the effect of multiple measures from single subjects, you can perform a repeated measures logistic regression. In SPSS, this can be done using the GENLIN command and indicating binomial as the probability distribution and logit as the link function to be used in the model. The exercise data file contains 3 pulse measurements from each of 30 people assigned to 2 different diet regiments and 3 different exercise regiments. If we define a “high” pulse as being over 100, we can then predict the probability of a high pulse using diet regiment. GET FILE='C:mydatahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/exercise.sav'. GENLIN highpulse (REFERENCE=LAST) BY diet (order = DESCENDING) /MODEL diet DISTRIBUTION=BINOMIAL LINK=LOGIT /REPEATED SUBJECT=id CORRTYPE = EXCHANGEABLE. These results indicate that diet is not statistically significant (Wald Chi-Square = 1.562, p = 0.211).
A factorial ANOVA has two or more categorical independent variables (either with or without the interactions) and a single normally distributed interval dependent variable. For example, using the hsb2 data file we will look at writing scores ( write ) as the dependent variable and gender ( female ) and socio-economic status ( ses ) as independent variables, and we will include an interaction of female by ses . Note that in SPSS, you do not need to have the interaction term(s) in your data set. Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables that will make up the interaction term(s). glm write by female ses. These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p = 0.002, respectively). However, that interaction between female and ses is not statistically significant (F = 0.133, p = 0.875). See also SPSS Textbook Examples from Design and Analysis: Chapter 10 SPSS FAQ: How can I do tests of simple main effects in SPSS? SPSS FAQ: How do I plot ANOVA cell means in SPSS? SPSS Library: An Overview of SPSS GLM
You perform a Friedman test when you have one within-subjects independent variable with two or more levels and a dependent variable that is not interval and normally distributed (but at least ordinal). We will use this test to determine if there is a difference in the reading, writing and math scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e., reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long format. SPSS handles this for you, but in other statistical packages you will have to reshape the data before you can conduct this test. npar tests /friedman = read write math. Friedman’s chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant. Hence, there is no evidence that the distributions of the three types of scores are different.
Ordered logistic regression is used when the dependent variable is ordered, but not continuous. For example, using the hsb2 data file we will create an ordered variable called write3 . This variable will have the values 1, 2 and 3, indicating a low, medium or high writing score. We do not generally recommend categorizing a continuous variable in this way; we are simply creating a variable to use for this example. We will use gender ( female ), reading score ( read ) and social studies score ( socst ) as predictor variables in this model. We will use a logit link and on the print subcommand we have requested the parameter estimates, the (model) summary statistics and the test of the parallel lines assumption. if write ge 30 and write le 48 write3 = 1. if write ge 49 and write le 57 write3 = 2. if write ge 58 and write le 70 write3 = 3. execute. plum write3 with female read socst /link = logit /print = parameter summary tparallel. The results indicate that the overall model is statistically significant (p < .000), as are each of the predictor variables (p < .000). There are two thresholds for this model because there are three levels of the outcome variable. We also see that the test of the proportional odds assumption is non-significant (p = .563). One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption. Because the relationship between all pairs of groups is the same, there is only one set of coefficients (only one model). If this was not the case, we would need different models (such as a generalized ordered logit model) to describe the relationship between each pair of outcome groups. See also SPSS Data Analysis Examples: Ordered logistic regression SPSS Annotated Output: Ordinal Logistic Regression
A factorial logistic regression is used when you have two or more categorical independent variables but a dichotomous dependent variable. For example, using the hsb2 data file we will use female as our dependent variable, because it is the only dichotomous variable in our data set; certainly not because it common practice to use gender as an outcome variable. We will use type of program ( prog ) and school type ( schtyp ) as our predictor variables. Because prog is a categorical variable (it has three levels), we need to create dummy codes for it. SPSS will do this for you by making dummy codes for all variables listed after the keyword with . SPSS will also create the interaction term; simply list the two variables that will make up the interaction separated by the keyword by . logistic regression female with prog schtyp prog by schtyp /contrast(prog) = indicator(1). The results indicate that the overall model is not statistically significant (LR chi2 = 3.147, p = 0.677). Furthermore, none of the coefficients are statistically significant either. This shows that the overall effect of prog is not significant. See also Annotated output for logistic regression
A correlation is useful when you want to see the relationship between two (or more) normally distributed interval variables. For example, using the hsb2 data file we can run a correlation between two continuous variables, read and write . correlations /variables = read write. In the second example, we will run a correlation between a dichotomous variable, female , and a continuous variable, write . Although it is assumed that the variables are interval and normally distributed, we can include dummy variables when performing correlations. correlations /variables = female write. In the first example above, we see that the correlation between read and write is 0.597. By squaring the correlation and then multiplying by 100, you can determine what percentage of the variability is shared. Let’s round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be 36%. Hence read shares about 36% of its variability with write . In the output for the second example, we can see the correlation between write and female is 0.256. Squaring this number yields .065536, meaning that female shares approximately 6.5% of its variability with write . See also Annotated output for correlation SPSS Learning Module: An Overview of Statistical Tests in SPSS SPSS FAQ: How can I analyze my data by categories? Missing Data in SPSS
Simple linear regression allows us to look at the linear relationship between one normally distributed interval predictor and one normally distributed interval outcome variable. For example, using the hsb2 data file , say we wish to look at the relationship between writing scores ( write ) and reading scores ( read ); in other words, predicting write from read . regression variables = write read /dependent = write /method = enter. We see that the relationship between write and read is positive (.552) and based on the t-value (10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence, we would say there is a statistically significant positive linear relationship between reading and writing. See also Regression With SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Textbook Examples: Introduction to the Practice of Statistics, Chapter 10 SPSS Textbook Examples: Regression with Graphics, Chapter 2 SPSS Textbook Examples: Applied Regression Analysis, Chapter 5
A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated. In our example, we will look for a relationship between read and write . We will not assume that both of these variables are normal and interval. nonpar corr /variables = read write /print = spearman. The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is statistically significant.
Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1). We have only one variable in the hsb2 data file that is coded 0 and 1, and that is female . We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output. The first variable listed after the logistic command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables. In our example, female will be the outcome variable, and read will be the predictor variable. As with OLS regression, the predictor variables must be either dichotomous or continuous; they cannot be categorical. logistic regression female with read. The results indicate that reading score ( read ) is not a statistically significant predictor of gender (i.e., being female), Wald = .562, p = 0.453. Likewise, the test of the overall model is not statistically significant, LR chi-squared – 0.56, p = 0.453. See also Annotated output for logistic regression SPSS Library: What kind of contrasts are these?
Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation. For example, using the hsb2 data file we will predict writing score from gender ( female ), reading, math, science and social studies ( socst ) scores. regression variable = write female read math science socst /dependent = write /method = enter. The results indicate that the overall model is statistically significant (F = 58.60, p = 0.000). Furthermore, all of the predictor variables are statistically significant except for read . See also Regression with SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Frequently Asked Questions SPSS Textbook Examples: Regression with Graphics, Chapter 3 SPSS Textbook Examples: Applied Regression Analysis
Analysis of covariance is like ANOVA, except in addition to the categorical predictors you also have continuous predictors as well. For example, the one way ANOVA example used write as the dependent variable and prog as the independent variable. Let’s add read as a continuous variable to this model, as shown below. glm write with read by prog. The results indicate that even after adjusting for reading score ( read ), writing scores still significantly differ by program type ( prog ), F = 5.867, p = 0.003. See also SPSS Textbook Examples from Design and Analysis: Chapter 14 SPSS Library: An Overview of SPSS GLM SPSS Library: How do I handle interactions of continuous and categorical variables?
Multiple logistic regression is like simple logistic regression, except that there are two or more predictors. The predictors can be interval variables or dummy variables, but cannot be categorical variables. If you have categorical predictors, they should be coded into one or more dummy variables. We have only one variable in our data set that is coded 0 and 1, and that is female . We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output. The first variable listed after the logistic regression command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables (listed after the keyword with ). In our example, female will be the outcome variable, and read and write will be the predictor variables. logistic regression female with read write. These results show that both read and write are significant predictors of female . See also Annotated output for logistic regression SPSS Textbook Examples: Applied Logistic Regression, Chapter 2 SPSS Code Fragments: Graphing Results in Logistic Regression
Discriminant analysis is used when you have one or more normally distributed interval independent variables and a categorical dependent variable. It is a multivariate technique that considers the latent dimensions in the independent variables for predicting group membership in the categorical dependent variable. For example, using the hsb2 data file , say we wish to use read , write and math scores to predict the type of program a student belongs to ( prog ). discriminate groups = prog(1, 3) /variables = read write math. Clearly, the SPSS output for this procedure is quite lengthy, and it is beyond the scope of this page to explain all of it. However, the main point is that two canonical variables are identified by the analysis, the first of which seems to be more related to program type than the second. See also discriminant function analysis SPSS Library: A History of SPSS Statistical Features
MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. For example, using the hsb2 data file , say we wish to examine the differences in read , write and math broken down by program type ( prog ). glm read write math by prog. The students in the different programs differ in their joint distribution of read , write and math . See also SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA GLM: MANOVA and MANCOVA SPSS Library: MANOVA and GLM
Multivariate multiple regression is used when you have two or more dependent variables that are to be predicted from two or more independent variables. In our example using the hsb2 data file , we will predict write and read from female , math , science and social studies ( socst ) scores. glm write read with female math science socst. These results show that all of the variables in the model have a statistically significant relationship with the joint distribution of write and read .
Canonical correlation is a multivariate technique used to examine the relationship between two groups of variables. For each set of variables, it creates latent variables and looks at the relationships among the latent variables. It assumes that all variables in the model are interval and normally distributed. SPSS requires that each of the two groups of variables be separated by the keyword with . There need not be an equal number of variables in the two groups (before and after the with ). manova read write with math science /discrim. * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. WITHIN CELLS Regression Multivariate Tests of Significance (S = 2, M = -1/2, N = 97 ) Test Name Value Approx. F Hypoth. DF Error DF Sig. of F Pillais .59783 41.99694 4.00 394.00 .000 Hotellings 1.48369 72.32964 4.00 390.00 .000 Wilks .40249 56.47060 4.00 392.00 .000 Roys .59728 Note.. F statistic for WILKS' Lambda is exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. WITHIN CELLS Regression (Cont.) Univariate F-tests with (2,197) D. F. Variable Sq. Mul. R Adj. R-sq. Hypoth. MS Error MS F READ .51356 .50862 5371.66966 51.65523 103.99081 WRITE .43565 .42992 3894.42594 51.21839 76.03569 Variable Sig. of F READ .000 WRITE .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .063 WRITE .049 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .649 WRITE .467 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Correlations between DEPENDENT and canonical variables Function No. Variable 1 READ .927 WRITE .854 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Variance in dependent variables explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 79.441 79.441 47.449 47.449 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for COVARIATES Function No. COVARIATE 1 MATH .067 SCIENCE .048 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for COVARIATES CAN. VAR. COVARIATE 1 MATH .628 SCIENCE .478 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Correlations between COVARIATES and canonical variables CAN. VAR. Covariate 1 MATH .929 SCIENCE .873 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Variance in covariates explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 48.544 48.544 81.275 81.275 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Regression analysis for WITHIN CELLS error term --- Individual Univariate .9500 confidence intervals Dependent variable .. READ reading score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .48129 .43977 .070 6.868 .000 SCIENCE .36532 .35278 .066 5.509 .000 COVARIATE Lower -95% CL- Upper MATH .343 .619 SCIENCE .235 .496 Dependent variable .. WRITE writing score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .43290 .42787 .070 6.203 .000 SCIENCE .28775 .30057 .066 4.358 .000 COVARIATE Lower -95% CL- Upper MATH .295 .571 SCIENCE .158 .418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT Multivariate Tests of Significance (S = 1, M = 0, N = 97 ) Test Name Value Exact F Hypoth. DF Error DF Sig. of F Pillais .11544 12.78959 2.00 196.00 .000 Hotellings .13051 12.78959 2.00 196.00 .000 Wilks .88456 12.78959 2.00 196.00 .000 Roys .11544 Note.. F statistics are exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Univariate F-tests with (1,197) D. F. Variable Hypoth. SS Error SS Hypoth. MS Error MS F Sig. of F READ 336.96220 10176.0807 336.96220 51.65523 6.52329 .011 WRITE 1209.88188 10090.0231 1209.88188 51.21839 23.62202 .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Raw discriminant function coefficients Function No. Variable 1 READ .041 WRITE .124 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized discriminant function coefficients Function No. Variable 1 READ .293 WRITE .889 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Estimates of effects for canonical variables Canonical Variable Parameter 1 1 2.196 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT (Cont.) Correlations between DEPENDENT and canonical variables Canonical Variable Variable 1 READ .504 WRITE .959 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The output above shows the linear combinations corresponding to the first canonical correlation. At the bottom of the output are the two canonical correlations. These results indicate that the first canonical correlation is .7728. The F-test in this output tests the hypothesis that the first canonical correlation is equal to zero. Clearly, F = 56.4706 is statistically significant. However, the second canonical correlation of .0235 is not statistically significantly different from zero (F = 0.1087, p = 0.7420).
Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables. All variables involved in the factor analysis need to be interval and are assumed to be normally distributed. The goal of the analysis is to try to identify factors which underlie the variables. There may be fewer factors than variables, but there may not be more factors than variables. For our example using the hsb2 data file , let’s suppose that we think that there are some common factors underlying the various test scores. We will include subcommands for varimax rotation and a plot of the eigenvalues. We will use a principal components extraction and will retain two factors. (Using these options will make our results compatible with those from SAS and Stata and are not necessarily the options that you will want to use.) factor /variables read write math science socst /criteria factors(2) /extraction pc /rotation varimax /plot eigen. Communality (which is the opposite of uniqueness) is the proportion of variance of the variable (i.e., read ) that is accounted for by all of the factors taken together, and a very low communality can indicate that a variable may not belong with any of the factors. The scree plot may be useful in determining how many factors to retain. From the component matrix table, we can see that all five of the test scores load onto the first factor, while all five tend to load not so heavily on the second factor. The purpose of rotating the factors is to get the variables to load either very high or very low on each factor. In this example, because all of the variables loaded onto factor 1 and not on factor 2, the rotation did not aid in the interpretation. Instead, it made the results even more difficult to interpret. See also SPSS FAQ: What does Cronbach’s alpha mean?
Your Name (required)
Your Email (must be a valid email for us to receive the report!)
Comment/Error Report (required)
How to cite this page
Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions.
"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock Holme's proclaims in Sir Arthur Conan Doyle's A Scandal in Bohemia.
This idea lies at the root of data analysis. When we can extract meaning from data, it empowers us to make better decisions. And we’re living in a time when we have more data than ever at our fingertips.
Companies are wisening up to the benefits of leveraging data. Data analysis can help a bank to personalize customer interactions, a health care system to predict future health needs, or an entertainment company to create the next big streaming hit.
The World Economic Forum Future of Jobs Report 2023 listed data analysts and scientists as one of the most in-demand jobs, alongside AI and machine learning specialists and big data specialists [ 1 ]. In this article, you'll learn more about the data analysis process, different types of data analysis, and recommended courses to help you get started in this exciting field.
Read more: How to Become a Data Analyst (with or Without a Degree)
Interested in building your knowledge of data analysis today? Consider enrolling in one of these popular courses on Coursera:
In Google's Foundations: Data, Data, Everywhere course, you'll explore key data analysis concepts, tools, and jobs.
In Duke University's Data Analysis and Visualization course, you'll learn how to identify key components for data analytics projects, explore data visualization, and find out how to create a compelling data story.
As the data available to companies continues to grow both in amount and complexity, so too does the need for an effective and efficient process by which to harness the value of that data. The data analysis process typically moves through several iterative phases. Let’s take a closer look at each.
Identify the business question you’d like to answer. What problem is the company trying to solve? What do you need to measure, and how will you measure it?
Collect the raw data sets you’ll need to help you answer the identified question. Data collection might come from internal sources, like a company’s client relationship management (CRM) software, or from secondary sources, like government records or social media application programming interfaces (APIs).
Clean the data to prepare it for analysis. This often involves purging duplicate and anomalous data, reconciling inconsistencies, standardizing data structure and format, and dealing with white spaces and other syntax errors.
Analyze the data. By manipulating the data using various data analysis techniques and tools, you can begin to find trends, correlations, outliers, and variations that tell a story. During this stage, you might use data mining to discover patterns within databases or data visualization software to help transform data into an easy-to-understand graphical format.
Interpret the results of your analysis to see how well the data answered your original question. What recommendations can you make based on the data? What are the limitations to your conclusions?
You can complete hands-on projects for your portfolio while practicing statistical analysis, data management, and programming with Meta's beginner-friendly Data Analyst Professional Certificate . Designed to prepare you for an entry-level role, this self-paced program can be completed in just 5 months.
Or, L earn more about data analysis in this lecture by Kevin, Director of Data Analytics at Google, from Google's Data Analytics Professional Certificate :
Read more: What Does a Data Analyst Do? A Career Guide
Data can be used to answer questions and support decisions in many different ways. To identify the best way to analyze your date, it can help to familiarize yourself with the four types of data analysis commonly used in the field.
In this section, we’ll take a look at each of these data analysis methods, along with an example of how each might be applied in the real world.
Descriptive analysis tells us what happened. This type of analysis helps describe or summarize quantitative data by presenting statistics. For example, descriptive statistical analysis could show the distribution of sales across a group of employees and the average sales figure per employee.
Descriptive analysis answers the question, “what happened?”
If the descriptive analysis determines the “what,” diagnostic analysis determines the “why.” Let’s say a descriptive analysis shows an unusual influx of patients in a hospital. Drilling into the data further might reveal that many of these patients shared symptoms of a particular virus. This diagnostic analysis can help you determine that an infectious agent—the “why”—led to the influx of patients.
Diagnostic analysis answers the question, “why did it happen?”
So far, we’ve looked at types of analysis that examine and draw conclusions about the past. Predictive analytics uses data to form projections about the future. Using predictive analysis, you might notice that a given product has had its best sales during the months of September and October each year, leading you to predict a similar high point during the upcoming year.
Predictive analysis answers the question, “what might happen in the future?”
Prescriptive analysis takes all the insights gathered from the first three types of analysis and uses them to form recommendations for how a company should act. Using our previous example, this type of analysis might suggest a market plan to build on the success of the high sales months and harness new growth opportunities in the slower months.
Prescriptive analysis answers the question, “what should we do about it?”
This last type is where the concept of data-driven decision-making comes into play.
Read more : Advanced Analytics: Definition, Benefits, and Use Cases
Data-driven decision-making, sometimes abbreviated to DDDM), can be defined as the process of making strategic business decisions based on facts, data, and metrics instead of intuition, emotion, or observation.
This might sound obvious, but in practice, not all organizations are as data-driven as they could be. According to global management consulting firm McKinsey Global Institute, data-driven companies are better at acquiring new customers, maintaining customer loyalty, and achieving above-average profitability [ 2 ].
If you’re interested in a career in the high-growth field of data analytics, consider these top-rated courses on Coursera:
Begin building job-ready skills with the Google Data Analytics Professional Certificate . Prepare for an entry-level job as you learn from Google employees—no experience or degree required.
Practice working with data with Macquarie University's Excel Skills for Business Specialization . Learn how to use Microsoft Excel to analyze data and make data-informed business decisions.
Deepen your skill set with Google's Advanced Data Analytics Professional Certificate . In this advanced program, you'll continue exploring the concepts introduced in the beginner-level courses, plus learn Python, statistics, and Machine Learning concepts.
Where is data analytics used .
Just about any business or organization can use data analytics to help inform their decisions and boost their performance. Some of the most successful companies across a range of industries — from Amazon and Netflix to Starbucks and General Electric — integrate data into their business plans to improve their overall business performance.
Data analysis makes use of a range of analysis tools and technologies. Some of the top skills for data analysts include SQL, data visualization, statistical programming languages (like R and Python), machine learning, and spreadsheets.
Read : 7 In-Demand Data Analyst Skills to Get Hired in 2022
Data from Glassdoor indicates that the average base salary for a data analyst in the United States is $75,349 as of March 2024 [ 3 ]. How much you make will depend on factors like your qualifications, experience, and location.
Data analytics tends to be less math-intensive than data science. While you probably won’t need to master any advanced mathematics, a foundation in basic math and statistical analysis can help set you up for success.
Learn more: Data Analyst vs. Data Scientist: What’s the Difference?
World Economic Forum. " The Future of Jobs Report 2023 , https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf." Accessed March 19, 2024.
McKinsey & Company. " Five facts: How customer analytics boosts corporate performance , https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance." Accessed March 19, 2024.
Glassdoor. " Data Analyst Salaries , https://www.glassdoor.com/Salaries/data-analyst-salary-SRCH_KO0,12.htm" Accessed March 19, 2024.
Coursera staff.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.
IMAGES
VIDEO
COMMENTS
This quickly done example of a research using multiple regression analysis revealed an interesting finding. The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated. The relationship means that the greater ...
The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...
160 PART II: BAsIc And AdvAnced RegRessIon AnAlysIs 5A.4 Multiple Regression Research 5A.4.1 Research Problems Suggesting a Regression Approach If the research problem is expressed in a form that either specifies or implies prediction, multiple regression analysis becomes a viable candidate for the design. Here are some examples of research
When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment ...
Photo by Ferdinand Stöhr on Unsplash. Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).
These questions can in principle be answered by multiple linear regression analysis. In the multiple linear regression model, Y has normal distribution with mean. The model parameters β 0 + β 1 + +β ρ and σ must be estimated from data. β 0 = intercept. β 1 β ρ = regression coefficients.
Second, multiple regression is an extraordinarily versatile calculation, underly-ing many widely used Statistics methods. A sound understanding of the multiple regression model will help you to understand these other applications. Third, multiple regression offers our first glimpse into statistical models that use more than two quantitative ...
The value of t .025 is found in a t-table, using the usual df of t for assessing statistical significance of a regression coefficient (N - the num-ber of X's - 1), and is the value that leaves a tail of the t-curve with 2.5% of the total probability. For instance, if df = 30, then t.025 = 2.042.
Objective: To test whether the relation between income inequality and mortality found in US states is because of different levels of formal education. Design: Cross sectional, multiple regression analysis. Setting: All US states and the District of Columbia (n=51). Data sources: US census statistics and vital statistics for the years 1989 and 1990. Main outcome measure: Multiple regression ...
With multiple regression what we're doing is looking at the effect of each variable, while holding the other variable constant. Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant, and that change is highly significant.
Multiple Regression Write Up. Here is an example of how to write up the results of a standard multiple regression analysis: In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable.
Abstract. Multiple regression is one of the most significant forms of regression and has a wide range. of applications. The study of the implementation of multiple regression analysis in different ...
In this chapter, we will discuss multiple regression. In multiple regression analysis, a single outcome variable is modeled as a linear combination of as many additional variables as desired. Multiple regression is sometimes used simply to understand factors that are relevant to predicting an outcome, but it is also used in science to help ...
Multiple linear regression analyses were used in order to examine moderation effects between anxiety, stress, self-esteem and affect on depression. The analysis indicated that about 52% of the variation in the dependent variable (i.e., depression) could be explained by the main effects and the interaction effects ( R 2 = .55, adjusted R 2 = .51 ...
Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score.
3. Multiple regression analysis The main purpose of this analysis is to know to what extent is the profit size influenced by the five independent variables and what are those measures that should be taken based on the results obtained with using SPSS - Statistical Package for Social Sciences [C. Constantin, 2006]. The table below provides us the
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value ...
Analyze Descriptive Statistics Descriptives. Move the desired variables into the "Variables window". Check the box on the lower left - "Save standardized values as variables". When you run this command, you will get the requested statistics, and new variables will be added to the spread sheet.
Multiple Linear Regression - MLR: Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of ...
Examples of multivariate regression. Example 1. A researcher has collected data on three psychological variables, four academic variables (standardized test scores), and the type of educational program the student is in for 600 high school students. She is interested in how the set of psychological variables is related to the academic variables ...
In simple regression analysis, plotting a residual plot of the predictor variable \(\times \) residuals was effective. In multiple regression analysis, as shown in Fig. 15.5, it is effective to draw a residual plot of the predicted values of the criterion variable \(\times \) residuals. For example, it can be observed that country number 13 has ...
Multiple Regression Using SPSS Performing the Analysis With SPSS Example 1: - We want to determine whether hours spent revising, anxiety scores, and A-level entry points have effect on exam scores for participants. Dependent variable: exam score Predictors: hours spent revising, anxiety scores, and A-level entry points.
You can expect to receive from me a few assignments in which I ask you to conduct a multiple regression analysis and then present the results. I suggest that you use the examples below as your models when preparing such assignments. Table 1. Graduate Grade Point Averages Related to Criteria Used When Making Admission Decisions (N = 30). Variable.
R denotes the multiple correlation coefficient. This is simply the Pearson correlation between the actual scores and those predicted by our regression model. R-square or R 2 is simply the squared multiple correlation. It is also the proportion of variance in the dependent variable accounted for by the entire regression model.
The research question for those using multiple regression concerns how the multiple independent variables, either by themselves or together, influence changes in the depen- ... analysis. The multiple regression example used in this chapter is as basic as possible—small sam - ... The research scenario centers on the belief that an individual ...
Example. Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor.4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction).
When running regression analysis, be it a simple linear or multiple regression, it's really important to check that the assumptions your chosen method requires have been met. If your data points don't conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data.
Simple Linear Regression: 5. Perform a simple linear regression using the ordinary least squares method on your dataset. a. Fit and transform your model on the dataset. b. Produce the summary output of the fitted model. c. Produce predictions from your input data (X). d.
Multivariate multiple regression. Multivariate multiple regression is used when you have two or more dependent variables that are to be predicted from two or more independent variables. In our example using the hsb2 data file, we will predict write and read from female, math, science and social studies (socst) scores.
Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...