The average wage for African Americans in the data is 808.5, and for others the average wage is 1174. That means that African Americans earn (in this really specific data set) 61% of how much men earn 365.5 less.
Let’s say we take that fact to someone that doesn’t believe that African Americans are discriminated against. We’ll call them you’re “contrarian friend”, you can fill in other ideas of what you’d think about that person. What will their response be? Probably that it isn’t evidence of discrimination, because of course African Americans earn less, they’re less likely to work in white collar jobs. And people that work in white collar jobs earn more, so that’s the reason African Americans earn less. It’s not discrimination, it’s just that they work different jobs.
And on the surface, they’d be right. African Americans are more likely to work in blue collar jobs (65% to 50%), and blue collar jobs earn less (956 for blue collar jobs to 1350 for white collar jobs).
ethnicity | blue_collar |
---|---|
other | 0.5018 |
afam | 0.6512 |
blue_collar | wage |
---|---|
0 | 1350 |
1 | 956.4 |
So what we’d want to do then is compare African Americans to others that both work blue collar jobs, and African Americans to others working white collar jobs. If there is a difference in wages between two people working the same job, that’s better evidence that the pay gap is a result not of their occupational choices but their race.
We can visualize that with a two by two chart.
Let’s work across that chart to see what it tells us. A 2 by 2 chart like that is called a cross tab because it let’s us tab ulate figures a cross different characteristics of our data. They can be a methodologically simple way (we’re just showing means/averages there) to tell a story if the data is clear.
So what do we learn? Looking at the top row, white collar workers that are labeled other for ethnicity earn on average $1373. And white collar workers that are African American earn $918. Which means that for white collar workers, African Americans earn $455 less. For blue collar workers, other races earn $977, while African Americans earn $749. That’s a gap of $228. So the size of the gap is different depending on what a persons job is, but African American’s earn less regardless of their job. So it isn’t just that African Americans are less likely to work white collar jobs that drives their lower wages. Even those in white collar jobs earn less. In fact, African Americans in white collar jobs earn less on average than other races working blue collar jobs!
This is what it means to hold something constant. In that table above we’re holding occupation constant, and comparing people based on their race to people of another race that work the same job. So differences in those jobs aren’t influencing our results now, we’ve set that effect aside for the moment.
And we can do that automatically with regression, like we did when we looked at the effect of computers on math scores, while holding the impact of school enrollment constant.
Based on that regression results, African Americans earn $309 less than other races when holding occupation constant, and that effect is highly significant. And blue collar workers earn $380 less than white collar workers when holding race constant, and that effect is significant too.
So have we proven discrimination in wages? Probably not yet for the contrarian friend. Without pause they’ll likely say that education is also important for wages, and African Americans are less likely to go to college. And in the data they’d be correct. On average African Americans completed 11.65 years of education, and other races completed 12.94.
ethnicity | education |
---|---|
other | 12.94 |
afam | 11.65 |
So let’s add that to our regression too.
Now with the ethnicity variable we’re comparing people of different ethnicities that have the same occupation and education. And what do we find? Even holding both of those constant, we would expect an African American worker to earn $262 less, and that is highly significant.
What your contrarian friend is doing is proposing alternative variables and hypotheses that explain the gap in earnings for African Americans. And while those other things do make a difference they don’t explain fully why African Americans earn less than others. We have shrunk the gap somewhat. Originally the gap was 465, which fell to 309 when we held occupation constant and now 262 with the inclusion of education. So those alternative explanations do explain a portion of why African Americans earned less, it was because they had lower-status jobs and less education (setting aside the fact that their lower-status jobs and less education may be the result of discrimination).
So what else do we want to include to try and explain that difference in wages? We can insert all of the variables in the data set to see if there is still a gap in wages between African Americans and others.
Controlling for occupation, education, experience, weeks worked, the industry, the region of employment, whether they are married, their gender, and their union status, does ethnicity make a difference in earnings? Yes, if you found two workers that had the same values for all of those variables except that they were of different races, the African American would still likely earn less.
In our regression African Americans earn $167 less when holding occupation, education, experience, weeks worked, the industry, region, marriage, gender, and their union status constant, and that effect is still statistically significant.
The contrarian friend may still have another alternative hypothesis to attempt to explain away that result, but unfortunately that’s all the data will let us test.
What we’re attempting to do is minimize what is called the missing variable bias . If there is a plausible story that explains our result, whether one is predicting math test scores or wages or whatever else, if we fail to account for that explanation our model may be misleading. It was misleading to say that computers don’t increase math test scores when we didn’t control for the effect of larger school sizes.
What missing variables do we not have that may explain the difference in earnings between African Americans and others? We don’t know who is a manager at work or anything about job performance, and both of those should help explain why people earn more. So we haven’t removed our missing variable bias, the evidence we can provide is limited by that. But based on the evidence we can generate, we find evidence of racial discrimination in wages.
And I should again emphasize, even if something else did explain the gap in earnings between African Americans and others it wouldn’t prove there wasn’t discrimination in society. If differences in occupation did explain the racial gap in wages, that wouldn’t prove the discrimination didn’t push African Americans towards lower paying jobs.
But the work we’ve done above is similar to what a law firm would do if bringing a lawsuit against a large employer for wage discrimination. It’s hard to prove discrimination in individual cases. The employer will always just argue that John is a bad employee, and that’s why they earn less than their coworkers. Wage discrimination suits are typically brought as class action suits, where a large group of employees sues based on evidence that even when accounting for differences in specific job, and job performance, and experience, and other things there is still a gap in wages.
I should add a note about interpretation here. It’s the researcher that has to identify what they different coefficients means in the real world. We can talk about discrimination because of differences in earnings for African Americans and others, but we wouldn’t say that blue collar workers are discriminated against because they earn less than white collar workers. It’s unlikely that someone would say that people with more experience earning more is the result of discrimination. These are interpretations that we layer on to the analysis based on our expectations and understanding of the research question.
Regression can be used to make predictions and learn more about the world in all sorts of contexts. Let’s work through another example, with a little more focus on the interpretation.
We’ll use a data set called Affairs, which unsurprisingly has data about affairs. Or more specifically, about people, and whether or not they have had an affair.
In the data set there are 10 variables.
So we can throw all of those variables into a regression and see which ones have the largest impact on the likelihood someone had an affair. But before that we should pause to make predictions. We shouldn’t just include a variable just for laughs - we should have a reason for including it. We should be able to make a prediction for whether it should increase or decrease the dependent variable.
So what effect do you think these independent variables will have on the chances of someone having had an affair?
Those arguments may be wrong or right. And they certainly wont be right in every case in the data - there will be counter examples. What I’ve tried to do is lay out predictions, or hypotheses, for what I expect the model to show us. Let’s test them all and see what predicts whether someone had an affair.
What do you see as the strongest predictors of whether someone had an affair? Let’s start by identifying what was highly statistically significant. Religiousness and rating both had p-values below .001, so we can be very confident that in the population people who are more religious and who report having happier marriages are both less likely to have affairs. Let’s interpret that more formally.
For each one unit increase in religiousness an individual’s chances of having an affair decrease by .05 holding their gender, age, years married, children, education, occupation and rating constant, and that change is significant.
That’s a long list of things we’re holding constant! When you get past 2 or 3 control variables, or when you’re describing different variables from the same model you can use “holding all else constant” in place of the list.
For each one unit increase in the happiness rating of a marriage an individual’s chances of having an affair decrease by .09, holding all else constant , and that change is significant.
What else that we included in the model is useful for predicting whether someone had an affair?
Age and years married both reach statistical significance. As individuals get older, their chances of having an affair decrease, as I predicted.
However, as their marriages get longer the chances of having had an affair increase, not decrease as I thought Interesting! Does that mean I should go back and change my prediction? No. What it likely means is that some of my assumptions were wrong, so I should update them and discuss why I was wrong (in the conclusion if this was a paper). If we only used regression to find things that we already know, we wouldn’t learn anything new. It’s still good that I made a prediction though because that highlights that the result is a little weird (to my eyes) or may be more surprising to the readers. Imagine if you found that a new jobs program actually lowered participants incomes, that would be a really important outcome of your research and just as valuable as if you’d found that incomes increase.
A surprising finding could also be evidence that there’s something wrong in the data. Did we enter years of marriage correctly, or did we possibly reverse it where longer marriages are actually coded as lower numbers. That’d be odd in this case, but it’s always worth thinking that possibility through. If I got data that showed college graduates earned less than those without a high school degree I’d be very skeptical of the data, because that would go against everything we know. It might just be an odd, fluky one-time finding, or it could be evidence something is wrong in the data.
Okay, what about everything else? All the other variables are insignificant. Should we remove them from the analysis, since they don’t have a significant effect on our dependent variable? It depends. Insignificant variables can be worth including in most cases in order to show that they don’t have an effect on the outcome. It’s worth knowing that gender and children don’t have an effect on affairs in the population. We had a reason to think they would, and it turns out they don’t really have much of an influence on whether someone has sex outside their marriage. That’s good to know.
I didn’t have a prediction for education or occupation though, and the fact they are insignificant means they aren’t really worth including. I’m not testing any interesting ideas about what affects affairs with those variables, they’re just being included because they’re in the data. That’s not a good reason for them to be there, we want to be testing something with each variable we include.
In truth, we haven’t done a lot of new work on code in this chapter. We’ve more so focused on this big idea of what it means to go from bivariate regression to multivariate regression. So we wont do a lot of practice, because the basic structure we learned in the last chapter drives most of what we’ll do.
We’ll read in some new data, that’s on Massachusetts schools and test scores there. It’s similar to the California Schools data, but from Massachusetts for variety.
We’ll focus on 4 of those variable, and try to figure out what predicts how schools do on tests in 8th grade (score8).
Let’s start by practicing writing a regression to look at the impact of spending (exptot) on test scores.
That should look very similar to the last chapter. And we can interpret it the same way.
For each one unit increase in spending, we observe a .004 increase in test scores for 8th graders, and that change is significant.
Let’s add one more variable to the regression, and now include english along with exptot. To include an additional variable we just place a + sign between the two variables, as shown below.
Each one unit increase in spending is associated with a .007 increase in test scores for 8th graders, holding the percentage of english speakers constant, and that change is significant.
Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 4.1 decrease in test scores for 8th graders, holding the spending constant, and that change is significant.
And one more, let’s add one more variable: income.
Interesting, spending actually lost its significance in that final regression and change directions.
Each one unit increase in spending is associated with a .002 decrease in test scores for 8th graders when holding the percentage of english speakers and parental income constant, but that change is insignificant.
Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 2.2 decrease in test scores for 8th graders when holding spending and parental income constant, and that change is significant.
Each one unit increase in parental income is associated with a 2.8 increase in test scores for 8th graders when holding spending and the percentage of english speakers constant, and that change is significant.
The following video demonstrates the coding steps done above.
Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.
To get started, let's read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:
We'll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.
Before going further you may wish to explore the data using the summary() and pairs() functions.
Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and "binds" them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition but rather inclusion. Taken together the formula cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS translates to "model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS." To fit this model we use the workhorse lm() function and save it to an object we name "mlm1". Finally we view the results with summary() .
Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if we modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They're identical.
The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis .
We can use R's extractor functions with our mlm1 object, except we'll get double the output. For example, instead of one set of residuals, we get two:
Instead of one set of fitted values, we get two:
Instead of one set of coefficients, we get two:
Instead of one residual standard error, we get two:
Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don't reproduce the output here because of the size, but we encourage you to view it for yourself:
The main takeaway is that the coefficients from both models covary . That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it's not enough to eyeball the results from the two separate regressions. We should formally test for their inclusion. And that test involves the covariances between the coefficients in both models.
Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term "multivariate" here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.
The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:
The results are titled "Type II MANOVA Tests". The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. "Type II" refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.
Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little "a"; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression . ~ . - PR - DIAP - QRS says "keep the same responses and predictors except PR, DIAP and QRS."
The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is "Pillai", which is one of the four common multivariate test statistics.
The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.
The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).
Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is $$ \frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}} $$
In R we can calculate that as follows:
Likewise the formula for Pillai is $$ tr[\bf{H}(\bf{H} + \bf{E})^{-1}] $$ tr means trace. That's the sum of the diagonal elements of a matrix. In R we can calculate as follows:
The formula for Hotelling-Lawley is $$ tr[\bf{H}\bf{E}^{-1}] $$ In R:
And finally the Roy statistics is the largest eigenvalue of \(\bf{H}\bf{E}^{-1}\). In R code:
Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary:
Now let's say we wanted to use this model to estimate mean TOT and AMI values for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data.
This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don't calculate an interval but rather an ellipse to capture the uncertainty in two dimensions.
Unfortunately at the time of this writing there doesn't appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However, we have written one below you can use called confidenceEllipse() . The details of the function go beyond a "getting started" blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics.
Here's a demonstration of the function.
The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We're 95% confident the true mean values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa.
Clay Ford Statistical Research Consultant University of Virginia Library October 27, 2017 Updated May 26, 2023 Update February 20, 2024 (changed function name)
For questions or clarifications regarding this article, contact [email protected] .
View the entire collection of UVA Library StatLab articles, or learn how to cite .
Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!
Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many graduate students find this too complicated to understand. However, this is not that difficult to do, especially with computers as everyday household items nowadays. You can now quickly analyze more than just two sets of variables in your research using multiple regression analysis.
Multiple regression is often confused with multivariate regression. Multivariate regression, while also using several variables, deals with more than one dependent variable . Karen Grace-Martin clearly explains the difference in her post on the difference between the Multiple Regression Model and Multivariate Regression Model .
Statistical software applications used in computing multiple regression analysis.
Using multiple regression analysis requires a dedicated statistical software like the popular Statistical Package for the Social Sciences (SPSS) , Statistica, Microstat, and open-source statistical software applications like SOFA statistics and Jasp, among other sophisticated statistical packages.
However, a standard spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the setting of statistical tools that ship with MS Excel.
The study pertains to identifying the factors predicting a current problem among high school students, the long hours they spend online for a variety of reasons. The purpose is to address many parents’ concerns about their difficulty of weaning their children away from the lures of online gaming, social networking, and other engaging virtual activities.
Given that there is a need to use a computer to analyze multiple variable data, a principal who is nearing retirement was “forced” to buy a laptop, as she had none. Anyhow, she is very much open-minded and performed the class activities that require data analysis with much enthusiasm.
“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”
Although many studies have identified factors that influence the use of the internet, it is standard practice to include the respondents’ profile among the set of predictor or independent variables. Hence, the standard variables age and gender are included in the multiple regression analysis.
The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated.
The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the fewer hours spent by her child using the internet. The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.
The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.
But establishing a close bond between mother and child is a good start. Undertaking more investigations along this research concern will help strengthen the findings of this study.
Thus, this example of a research using multiple regression analysis streamlines solutions and focuses on those influential factors that must be given attention.
Statistical research questions: five examples for quantitative analysis, 18 species of insects from a ceiling lamp, how to write a good abstract: four essential elements with example, about the author, patrick regoniel.
the example is good but lacks the table of regression results. With the tables, a student could learn more on how to interpret regression results
Introduction.
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).
For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.
Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.
This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.
Assumptions.
When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:
You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.
In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.
A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.
In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .
The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).
Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .
Published with written permission from SPSS Statistics, IBM Corporation.
Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.
SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.
However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:
The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:
The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.
The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).
The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:
predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )
This is obtained from the Coefficients table, as shown below:
Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.
You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:
You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.
You could write up the results as follows:
A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.
If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.
Content preview.
Arcu felis bibendum ut tristique et egestas quis:
6.4 - the hypothesis tests for the slopes.
At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:
Is the regression model containing at least one predictor useful in predicting the size of the infarct?
Is the size of the infarct significantly (linearly) related to the area of the region at risk?
(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?
Let's test each of the hypotheses now using the general linear F -statistic:
\(F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)\)
To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — \(df_{R}\) and \(df_{F}\), respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.
To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:
The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:
\(y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)
The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4\).
The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:
\(y_i=\beta_0+\epsilon_i\)
The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 1 \).
Upon plugging in the above quantities, the general linear F -statistic:
\(F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}\)
becomes the usual " overall F -test ":
\(F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}\)
That is, to test \(H_{0}\) : \(\beta_{1} = \beta_{2} = \beta_{3} = 0 \), we just use the overall F -test and P -value reported in the analysis of variance table:
Source | DF | Adj SS | Adj MS | F- Value | P-Value |
---|---|---|---|---|---|
Regression | 3 | 0.95927 | 0.31976 | 16.43 | 0.000 |
Area | 1 | 0.63742 | 0.63742 | 32.75 | 0.000 |
X2 | 1 | 0.29733 | 0.29733 | 15.28 | 0.001 |
X3 | 1 | 0.01981 | 0.01981 | 1.02 | 0.322 |
Error | 28 | 0.54491 | 0.01946 | ||
31 | 1.50418 |
Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3
There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.
In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.
Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:
Again, the full model is the model containing all of the possible predictors:
The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are \(x_{1}\), \(x_{2}\), and \(x_{3}\), we can denote the error sum of squares as SSE (\(x_{1}\), \(x_{2}\), \(x_{3}\)). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4 \).
Because the null hypothesis sets the first slope parameter, \(\beta_{1}\), equal to 0, the reduced model is:
\(y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)
Because the two predictors in the model are \(x_{2}\) and \(x_{3}\), we denote the error sum of squares as SSE (\(x_{2}\), \(x_{3}\)). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 3\).
The general linear statistic:
simplifies to:
\(F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\)
Getting the numbers from the Minitab output:
we determine that the value of the F -statistic is:
\(F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554\)
The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:
x | P ( X ≤x ) |
---|---|
32.7554 | 1.00000 |
Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.
But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, \(\beta_{1}\), is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -0.135 | 0.104 | -1.29 | 0.206 | |
Area | 0.613 | 0.107 | 5.72 | 0.000 | 1.14 |
X2 | -0.2435 | 0.0623 | -3.91 | 0.001 | 1.44 |
X3 | -0.0657 | 0.0651 | -1.01 | 0.322 | 1.57 |
is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:
\(t_{(n-p)}^{2}=F_{(1, n-p)}\)
For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:
\(t^{*2}=5.72^2=32.72=F^*\)
So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:
Compare the output obtained when \(x_{1}\) = Area is entered into the model last :
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -0.135 | 0.104 | -1.29 | 0.206 | |
X2 | -0.2435 | 0.0623 | -3.91 | 0.001 | 1.44 |
X3 | -0.0657 | 0.0651 | -1.01 | 0.322 | 1.57 |
Area | 0.613 | 0.107 | 5.72 | 0.000 | 1.14 |
Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area
to the output obtained when \(x_{1}\) = Area is entered into the model first :
The t -statistic and P -value are the same regardless of the order in which \(x_{1}\) = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.
Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:
Because the null hypothesis sets the second and third slope parameters, \(\beta_{2}\) and \(\beta_{3}\), equal to 0, the reduced model is:
\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)
The ANOVA table for the reduced model is:
Source | DF | Adj SS | Adj MS | F- Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 0.6249 | 0.62492 | 21.32 | 0.000 |
Area | 1 | 0.6249 | 0.62492 | 21.32 | 0.000 |
Error | 30 | 0.8793 | 0.02931 | ||
31 | 1.5042 |
Because the only predictor in the model is \(x_{1}\), we denote the error sum of squares as SSE (\(x_{1}\)) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 2 = 32 – 2 = 30\).
\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}
Alternatively, we can calculate the F-statistic using a partial F-test :
\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}
To conduct the test, we regress y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3 }\)— in order (and with "Sequential sums of squares" selected under "Options"):
Source | DF | Seq SS | Seq MS | F- Value | P-Value |
---|---|---|---|---|---|
Regression | 3 | 0.95927 | 0.31976 | 16.43 | 0.000 |
Area | 1 | 0.62492 | 0.63492 | 32.11 | 0.000 |
X2 | 1 | 0.3143 | 0.31453 | 16.16 | 0.001 |
X3 | 1 | 0.01981 | 0.01981 | 1.02 | 0.322 |
Error | 28 | 0.54491 | 0.01946 | ||
31 | 1.50418 |
Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3
yielding SSR (\(x_{2}\) | \(x_{1}\)) = 0.31453, SSR (\(x_{3}\) | \(x_{1}\), \(x_{2}\)) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:
\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}
which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:
x | P ( X ≤ x ) |
---|---|
8.59 | 0.998767 |
tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.
For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:
We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.
The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.
To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:
For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors \(x_{2}\) and \(x_{3}\) using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.
To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and \(5.72^{2} = 32.72\).
Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:
The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:
Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.
The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:
(Note the correct specification of the alternative hypotheses for the last two situations.)
A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen (\(x_{1}\) = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.
The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.
Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, \(x_{2}\) and \(x_{3}\), say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:
treat | \(x_2\) | \(x_3\) |
---|---|---|
1 | 1 | 0 |
2 | 0 | 1 |
3 | 0 | 0 |
Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet
Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).
Then, if we assume the trend in the data can be summarized by this regression model:
\(y_{i} = \beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) + \(\beta_{3}\)\(x_{3}\) + \(\epsilon_{i}\)
where \(x_{1}\) = nit and \(x_{2}\) and \(x_{3}\) are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter \(\beta_{2}\) quantify? And, what does the parameter \(\beta_{3}\) quantify?
The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 \(x_{2}\) - 2.35 \(x_{3}\), which means that the equations for each treatment group are:
The three estimated regression lines are parallel since they have the same slope, 1.3088.
The regression parameter for \(x_{2}\) represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.
The regression parameter for \(x_{3}\) represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.
\(H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0\) against the alternative \(H_A \colon \) at least one of the \(\beta_i\) is not 0.
\(F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\)
\(F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95\)
Since the p -value for this F -statistic is reported as 0.000, we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.
\(H_0 \colon \beta_1= 0\) against the alternative \(H_A \colon \beta_1 \ne 0\)
t -statistic = 19.60, p -value = 0.000, so we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that the slope parameter for \(x_{1}\) = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).
\(F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\)
Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal \(t^{2}\), the square of the t -statistic as we might expect?
\(F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32\), which is the same as \(19.60^{2}\).
Because \(t^{2}\) will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the \(x_{1}\) predictor after \(x_{2}\) and \(x_{3}\) have already been taken into account.
\(H_0 \colon \beta_2=\beta_3= 0\) against the alternative \(H_A \colon \beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).
\(F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\)
\(F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46\).
x | P ( X ≤ x ) |
---|---|
0.46 | 0.363677 |
p-value \(= 1-0.363677 = 0.636\), so we fail to reject \(H_{0}\) in favor of \(H_{A}\) and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.
Note that the sequential mean square due to regression, MSR(\(X_{2}\),\(X_{3}\)|\(X_{1}\)), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors \(X_{2}\) and \(X_{3}\) are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the \(\alpha= 0.05\) level.
The multiple regression analysis expands the simple linear regression to allow for multiple independent (predictor) variables. The model created now includes two or more predictor variables, but still contains a single dependent (criterion) variable.
Assumptions
Running Multiple Linear Regression in SPSS
Interpreting Output
Reporting Results in APA Style
A multiple regression was run to predict job satisfaction from salary, years of experience, and perceived appreciation. This resulted in a significant model, F (3, 72) = 16.2132, p < .01, R2 = .638. The individual predictors were examined further and indicated that salary ( t = 9.21, p < .01) and perceived appreciation ( t = 15.329, p < .001) were significant predictors but, years of experience was not ( t = 1.16, p = .135).
IMAGES
VIDEO
COMMENTS
The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...
2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...
Hypothesis Testing in Multiple Linear Regression BIOST 515 January 20, 2004. 1 Types of tests • Overall test • Test for addition of a single variable • Test for addition of a group of variables. 2 ... As in simple linear regression, under the null hypothesis t 0 =
Multiple Regression Write Up. Here is an example of how to write up the results of a standard multiple regression analysis: In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable. ...
Research Hypotheses and Multiple Regression, cont. 3. The differential performance of a specific model across two or more groups (populations, treatments, etc.) • Whether the model produces equivalent R2 for the groups • Allows us to look for much more than "mean differences" • Important for population generalizability questions
Testing that individual coefficients take a specific value such as zero or some other value is done in exactly the same way as with the simple two variable regression model. Now suppose we wish to test that a number of coefficients or combinations of coefficients take some particular value. In this case we will use the so called "F-test".
Linear regression has an additive assumption: $ sales = β 0 + β 1 × tv + β 2 × radio + ε $. i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of 100 β 2 USD in sales on average, regardless of how much you spend on radio ads. We saw that in Fig 3.5 above.
a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...
This video provides a clear explanation of the nature of hypothesis testing around a multiple regression model. It builds on the example introduced in Multip...
A regression analysis is used for one (or more) of three purposes: modeling the relationship between x and y; prediction of the target variable (forecasting); and testing of hypotheses. The chapter introduces the basic multiple linear regression model, and discusses how this model can be used for these three purposes.
A population model for a multiple linear regression model that relates a y -variable to p -1 x -variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...
This video is an introduction to multiple regression analysis, with a focus on conducting a hypothesis test. If I look tired in the video, it's because I've ...
Multiple Regression. Regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the impact of a change in one variable on another.
Research Hypotheses and Multiple Regression, cont. 3. The differential performance of a specific model across two or more groups (populations, treatments, etc.) • Whether the model produces equivalent R2 for the groups • This involves comparing the models' fit (Fisher's Z-test) and the models' structure (t- or Z-test)
With multiple regression what we're doing is looking at the effect of each variable, while holding the other variable constant. Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant, and that change is highly significant.
Note on TerminologyWhen we have two or more predictors and fit a linear model by least squares, we are formally said to fit a least squares linear m. ltiple re-gression. Most folks just call it "multiple regression."You may also see the abbreviation OLS used with thi. kind of analy-sis. It stands for "Ordina.
Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score.
The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to ...
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value ...
For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0.
Multiple regression is the term applied to the prediction of a dependent variable by several (rather than one) independent variables. From: ... If the R 2 value is larger than the value in the table, the model is significant (reject the null hypothesis and accept the research hypothesis that the X variables do help to predict Y).
The multiple regression analysis expands the simple linear regression to allow for multiple independent (predictor) variables. The model created now includes two or more predictor variables, but still contains a single dependent (criterion) variable. Assumptions. Residuals are approximately normally distributed - checked using histogram, P-P ...
The purpose of this paper is threefold. First, the exact nature of the research hypothesis being tested will be discussed, which will lead directly into the notion of directional and nondirectional research hypotheses. Secondly, it will be shown how the generalized multiple linear regression technique can be used to obtain answers
Isotonic regression forms the basis of a one-sided test of the hypothesis for positive selection based on fitting the indicator that the typed strain is G as a logistic regression in which the logarithm of the odds ratio is a non-decreasing function of time. We use the residual deviance of the fitted model as our test statistics.