Multiple Regression Analysis Example with Conceptual Framework

Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many graduate students find this too complicated to understand. However, this is not that difficult to do, especially with computers as everyday household items nowadays. You can now quickly analyze more than just two sets of variables in your research using multiple regression analysis. 

How is multiple regression analysis done? This article explains this handy statistical test when dealing with many variables, then provides an example of a research using multiple regression analysis to show how it works. It explains how research using multiple regression analysis is conducted.

Table of Contents

Statistical software applications used in computing multiple regression analysis.

Multiple regression analysis is a powerful statistical test used to find the relationship between a given dependent variable and a set of independent variables .

Two decades ago, it will be near impossible to do the calculations using the obsolete simple calculator replaced by smartphones. 

However, a standard spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the setting of statistical tools that ship with MS Excel.

Activating MS Excel

Multiple Regression Analysis Example

I will illustrate the use of multiple regression analysis by citing the actual research activity that my graduate students undertook two years ago.

Review of Literature on Internet Use and Its Effect on Children

Upon reviewing the literature, the graduate students discovered that very few studies were conducted on the subject. Studies on problems associated with internet use are still in its infancy as the Internet has just begun to influence everyone’s life.

Hence, with my guidance, the group of six graduate students comprising school administrators, heads of elementary and high schools, and faculty members proceeded with the study.

Given that there is a need to use a computer to analyze multiple variable data, a principal who is nearing retirement was “forced” to buy a laptop, as she had none. Anyhow, she is very much open-minded and performed the class activities that require data analysis with much enthusiasm.

The Research on High School Students’ Use of the Internet

They correlated the time high school students spent online with their profile. The students’ profile comprised more than two independent variables, hence the term “multiple.” The independent variables are age, gender, relationship with the mother, and relationship with the father.

“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”

Although many studies have identified factors that influence the use of the internet, it is standard practice to include the respondents’ profile among the set of predictor or independent variables. Hence, the standard variables age and gender are included in the multiple regression analysis.

Findings of the Research Using Multiple Regression Analysis

The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the fewer hours spent by her child using the internet. The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

While this example of a research using multiple regression analysis may be a significant finding, the mother-child bond accounts for only a small percentage of the variance in total hours spent by the child online. This observation means that other factors need to be addressed to resolve long waking hours and abandonment of serious study of lessons by children.

But establishing a close bond between mother and child is a good start. Undertaking more investigations along this research concern will help strengthen the findings of this study.

The identification of significant predictors can help determine the correct intervention to resolve the problem. Using multiple regression approaches prevents unnecessary costs for remedies that do not address an issue or a question.

Related Posts

How to write a thesis statement, how is sample size calculated 4 considerations, must-visit sites for statistics, about the author, patrick regoniel.

This is an action research Daniel. And I have updated it here. It can set off other studies. And please take note that blogs nowadays are already recognized sources of information. Please read my post here on why this is so: https://simplyeducate.me/wordpress_Y//2019/09/26/using-blogs-in-education/

SimplyEducate.Me Privacy Policy

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

example of research using multiple regression analysis

Multiple linear regression formula

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved June 14, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
49411336.29NA NA NA NA
49211078.78 2257.50765.7178530.003509036

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 8.72269610.09616
14.16309013.70842314.61776
18.91675418.20618919.62732

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 2.94670915.87214
14.163090 7.72089820.60528
18.91675412.45146125.38205

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

example of research using multiple regression analysis

Multiple linear regression: Theory and applications

Linear least-squares explained in detail and implemented from scratch in python.

Bruno Scalia C. F. Leite

Bruno Scalia C. F. Leite

Towards Data Science

Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009). In these models, as their name suggests, a predicted (or response) variable is described by a linear combination of predictors. The term “ multiple refers to the predictor variables.

Throughout this article, the underlying principles of the Ordinary Least-Squares (OLS) regression model will be described in detail, and a regressor will be implemented from scratch in Python. All the code used is available in this example notebook .

Linear regression is already available in many Python frameworks. Therefore, in practice, one does not need to implement it from scratch to estimate regression coefficients and make predictions. However, our goal here is to gain insight into how these models work and their assumptions to be more effective when tackling future projects. From the usual frameworks, I suggest checking OLS from statsmodels and LinearRegression from sklearn .

Let us dive in.

Linear least-squares

Before diving into equations, I would like to define some notation guidelines.

  • Matrices: uppercase italic bold.
  • Vectors: lowercase italic bold.
  • Scalars: regular italic.

A multiple linear regression model, or an OLS, can be described by the equation below.

In which yᵢ is the dependent variable (or response) of observation i , β ₀ is the regression intercept, βⱼ are coefficients associated with decision variables j , xᵢⱼ is the decision variable j of observation i , and ε is the residual term. In matrix notation, it can be described by:

In which β is a column vector of parameters.

The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions (Hastie et al, 2009). When adopting a linear model, one should be aware of these assumptions to make correct inferences about the results and to perform necessary changes.

The residual terms ε are assumed to be normally and independently distributed with mean zero and constant variance σ ². Some model properties, such as confidence intervals of parameters and predictions, strongly rely on these assumptions about ε . Verifying them is, therefore, essential to obtain meaningful results.

The goal of the linear least-squares regression model is to find the values for β that minimize the sum of squared residuals (or squared errors), given by the equation below.

This is an optimization problem with an analytical solution. The formula is based on the gradient of each prediction with respect to the vector of parameters β , which corresponds to the vector of independent variables itself x ᵢ . Consider a matrix C , given by the equation below.

The least-squares estimate of β is given by:

Let us, in the next steps, create a Python class, LinearRegression, to perform these estimations. But before, let us import some useful libraries and functions to use throughout this article.

The first step is to create an estimator class and a method to include a column of ones in the matrix of estimators if we want to consider an intercept β ₀.

Now, let us implement a fit method ( sklearn -like) to estimate β.

A method for predictions.

And a method for computing the R² metric.

Statistical significance of parameters

It is useful to test the statistical significance of parameters β to verify the relevance of a predictor. When doing so, we are able to remove poor predictors, avoid confounding effects, and improve model performance on new predictions. To do so, one should test the null hypothesis that the parameter β associated with a given predictor is zero. Let us compute the variance-covariance matrix V of the parameters β and their corresponding standard errors using the matrix C and the variance of residuals σ̂² .

In Python, we can just add the following lines to the fit method:

Then, we might get the t -value associated with the null hypothesis and its corresponding p -value. In Python, it can be done by using the t generator instance from scipy.stats (previously imported here as t_fun ).

Now we have our tools ready to estimate regression coefficients and their statistical significance and to make predictions from new observations. Let us apply this framework in the next section.

Wire bond example

This is an example in which the goal is to predict the pull strength of a wire bond in a semiconductor manufacturing process based on wire length and die height. It was retrieved from Montgomery & Runger (2003). This is a small dataset, a situation in which linear models can be especially useful. I saved a copy of it in a .txt file in the same repository as the example notebook.

Let us first import the dataset.

And then define the matrix of independent variables X and the vector of the observed values of the predicted variable y .

I’ve created a few scatter plots to see how the predictors are related to the dependent variable.

Notice that a strong linear relationship exists between the predicted variable and the regressor wire length. Conversely, the linear relationship between die height and wire bond is not so evident in the pairwise visualization, although this might be attributed to the effects of the other predictor.

Next, let us create an instance of the LinearRegression class, fit it to the data, and verify its performance based on the R² metric.

Which returns the value of 0.9811.

It seems promising! Now, to verify the statistical significance of the parameters, let us run the following code:

Which returns:

Therefore, we have a model with great performance and statistical significance, which is likely to perform well on new observations as long as data distribution does not significantly change. Notice that, according to its p -value, we might consider dropping the intercept. Finally, let us verify the model assumptions.

  • Residuals are normally and independently distributed.
  • The mean of residuals is zero.
  • Residuals have constant variance.

Let us first verify the mean of residuals.

Which seems okay:

And now, let us use the Shapiro-Wilk test for normality.

Therefore, we can not reject the null hypothesis that our residuals come from a normal distribution.

Finally, let us plot the residuals versus the predicted variable and regressors to verify if they are independently distributed.

The residuals are mostly well distributed, especially in the region of wire length below seven and target below 25. However, there might exist some nonlinearity between the target and wire length as the residuals in intermediate values are biased towards negative values, whereas high values (target > 50) are biased towards positive values. Those interested in exploring the problem in more detail might try to create polynomial features and see if this bias is reduced. Moreover, the variance does not seem to be conditionally distributed. Therefore, it is unlikely that we could improve model performance by weighting residuals.

In the next section, let us create a false predictor correlated to the first independent variable and verify if the statistical significance test can identify it.

Adding a false predictor

The false predictor will be equal to the wire length added to a random noise term following a normal distribution of mean zero and sigma one.

Notice that it is also linearly correlated to the response variable. In fact, even more correlated than the die height.

Let us repeat the process from the previous section to see the results.

It seems the scoring metric is still great (0.9828), but not necessarily our results are meaningful as before. Let us verify the confidence intervals.

And it seems we found the impostor…

Therefore, our framework effectively verifies that the false predictor does not provide additional information to the model, given that we still have the original values of wire length, even though the false predictor has a strong linear correlation to the predicted variable. Conversely, the die height, which has a weaker correlation, contributes to the model with statistical significance.

In such situations, removing unnecessary features from the model is highly recommended to improve its interpretability and generality. In the example notebook , I also present a strategy for recursive feature elimination based on p -values.

Further reading

As one with an Engineering background, the first reference I should recommend in the area is the book Applied Statistics and Probability for Engineers by Montgomery & Runger (2003). There, one might find a more detailed description of the fundamentals herein presented and other relevant regression aspects such as multicollinearity, confidence intervals on the mean response, and feature selection.

Those more interested in Machine Learning can refer to The Elements of Statistical Learning by Hastie et al. (2009). In particular, I find it very interesting to see how the authors explain the bias-variance trade-off comparing nearest neighbors to linear models.

Nonlinear regression is also fascinating when one aims to estimate parameters with some fundamental meaning to describe nonlinear phenomena. The book Nonlinear Regression Analysis and Its Applications by Bates & Watts (1988) is a great reference. The subject is also presented in Chapter 3 of the book Generalized Linear Models by Myers et al. (2010).

Lecture notes by professor Cosma Shalizi are available on this link . Currently, the draft is under the title of Advanced Data Analysis from an Elementary Point of View . There, one can find interesting subjects such as Weighting and Variance, Causal Inference, and Dependent Data.

I appreciate the Python package statsmodels . After applying OLS (as we performed in this article), one might be interested in trying WLS for problems with an uneven variance of residuals and using VIF for detecting feature multicollinearity.

Conclusions

In this article, the main principles of multiple linear regression were presented, followed by implementation from scratch in Python. The framework was applied to a simple example, in which the statistical significance of parameters was verified besides the main assumptions about residuals in linear least-squares problems. The complete code and additional examples are available in this link .

Bates, D. M. & Watts, D. G., 1988. Nonlinear Regression Analysis and Its Applications. Wiley.

Hastie, T., Tibshirani, R. & Friedman, J. H., 2009. The Elements of Statistical Learning: Data mining, Inference, and Prediction. 2nd ed. New York: Springer.

Montgomery, D. C. & Runger, G., 2003. Applied Statistics and Probability for Engineers. 3rd ed. John Wiley and Sons.

Myers, R. H., Montgomery, D. C., Vining, G. G. & Robinson, T. J., 2012. Generalized linear models: with applications in engineering and the sciences. 2nd ed. Hoboken: John Wiley & Sons.

Shalizi, C., 2021. Advanced Data Analysis from an Elementary Point of View. Cambridge University Press.

Bruno Scalia C. F. Leite

Written by Bruno Scalia C. F. Leite

Chemical Engineer, Researcher, Optimization Enthusiast, and Data Scientist passionate about describing phenomena using mathematical models.

Text to speech

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Hippokratia
  • v.14(Suppl 1); 2010 Dec

Introduction to Multivariate Regression Analysis

Statistics are used in medicine for data description and inference. Inferential statistics are used to answer questions about the data, to test hypotheses (formulating the alternative or null hypotheses), to generate a measure of effect, typically a ratio of rates or risks, to describe associations (correlations) or to model relationships (regression) within the data and, in many other functions. Usually point estimates are the measures of associations or of the magnitude of effects. Confounding, measurement errors, selection bias and random errors make unlikely the point estimates to equal the true ones. In the estimation process, the random error is not avoidable. One way to account for is to compute p-values for a range of possible parameter values (including the null). The range of values, for which the p-value exceeds a specified alpha level (typically 0.05) is called confidence interval. An interval estimation procedure will, in 95% of repetitions (identical studies in all respects except for random error), produce limits that contain the true parameters. It is argued that the question if the pair of limits produced from a study contains the true parameter could not be answered by the ordinary (frequentist) theory of confidence intervals 1 . Frequentist approaches derive estimates by using probabilities of data (either p-values or likelihoods) as measures of compatibility between data and hypotheses, or as measures of the relative support that data provide hypotheses. Another approach, the Bayesian, uses data to improve existing (prior) estimates in light of new data. Proper use of any approach requires careful interpretation of statistics 1 , 2 .

The goal in any data analysis is to extract from raw information the accurate estimation. One of the most important and common question concerning if there is statistical relationship between a response variable (Y) and explanatory variables (Xi). An option to answer this question is to employ regression analysis in order to model its relationship. There are various types of regression analysis. The type of the regression model depends on the type of the distribution of Y; if it is continuous and approximately normal we use linear regression model; if dichotomous we use logistic regression; if Poisson or multinomial we use log-linear analysis; if time-to-event data in the presence of censored cases (survival-type) we use Cox regression as a method for modeling. By modeling we try to predict the outcome (Y) based on values of a set of predictor variables (Xi). These methods allow us to assess the impact of multiple variables (covariates and factors) in the same model 3 , 4 .

In this article we focus in linear regression. Linear regression is the procedure that estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable which should be quantitative. Logistic regression is similar to a linear regression but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model.

Linear equation

In most statistical packages, a curve estimation procedure produces curve estimation regression statistics and related plots for many different models (linear, logarithmic, inverse, quadratic, cubic, power, S-curve, logistic, exponential etc.). It is essential to plot the data in order to determine which model to use for each depedent variable. If the variables appear to be related linearly, a simple linear regression model can be used but in the case that the variables are not linearly related, data transformation might help. If the transformation does not help then a more complicated model may be needed. It is strongly advised to view early a scatterplot of your data; if the plot resembles a mathematical function you recognize, fit the data to that type of model. For example, if the data resemble an exponential function, an exponential model is to be used. Alternatively, if it is not obvious which model best fits the data, an option is to try several models and select among them. It is strongly recommended to screen the data graphically (e.g. by a scatterplot) in order to determine how the independent and dependent variables are related (linearly, exponentially etc.) 4 – 6 .

The most appropriate model could be a straight line, a higher degree polynomial, a logarithmic or exponential. The strategies to find an appropriate model include the forward method in which we start by assuming the very simple model i.e. a straight line (Y = a + bX or Y = b 0 + b 1 X ). Then we find the best estimate of the assumed model. If this model does not fit the data satisfactory, then we assume a more complicated model e.g. a 2nd degree polynomial (Y=a+bX+cX 2 ) and so on. In a backward method we assume a complicated model e.g. a high degree polynomial, we fit the model and we try to simplify it. We might also use a model suggested by theory or experience. Often a straight line relationship fits the data satisfactory and this is the case of simple linear regression. The simplest case of linear regression analysis is that with one predictor variable 6 , 7 .

Linear regression equation

The purpose of regression is to predict Y on the basis of X or to describe how Y depends on X (regression line or curve)

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e001.jpg

The Xi (X 1 , X 2 , , X k ) is defined as "predictor", "explanatory" or "independent" variable, while Y is defined as "dependent", "response" or "outcome" variable.

Assuming a linear relation in population, mean of Y for given X equals α+βX i.e. the "population regression line".

If Y = a + bX is the estimated line, then the fitted

Ŷi = a + bXi is called the fitted (or predicted) value, and Yi Ŷi is called the residual.

The estimated regression line is determined in such way that (residuals) 2 to be the minimal i.e. the standard deviation of the residuals to be minimized (residuals are on average zero). This is called the "least squares" method. In the equation

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e002.jpg

b is the slope (the average increase of outcome per unit increase of predictor)

a is the intercept (often has no direct practical meaning)

A more detailed (higher precision of the estimates a and b) regression equation line can also be written as

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e003.jpg

Further inference about regression line could be made by the estimation of confidence interval (95%CI for the slope b). The calculation is based on the standard error of b:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e004.jpg

so, 95% CI for β is b ± t0.975*se(b) [t-distr. with df = n-2]

and the test for H0: β=0, is t = b / se(b) [p-value derived from t-distr. with df = n-2].

If the p value lies above 0.05 then the null hypothesis is not rejected which means that a straight line model in X does not help predicting Y. There is the possibility that the straight line model holds (slope = 0) or there is a curved relation with zero linear component. On the other hand, if the null hypothesis is rejected either the straight line model holds or in a curved relationship the straight line model helps, but is not the best model. Of course there is the possibility for a type II or type I error in the first and second option, respectively. The standard deviation of residual (σ res ) is estimated by

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e005.jpg

The standard deviation of residual (σ res ) characterizes the variability around the regression line i.e. the smaller the σ res , the better the fit. It has a number of degrees of freedom. This is the number to divide by in order to have an unbiased estimate of the variance. In this case df = n-2, because two parameters, α and β, are estimated 7 .

Multiple linear regression analysis

As an example in a sample of 50 individuals we measured: Y = toluene personal exposure concentration (a widespread aromatic hydrocarbon); X1 = hours spent outdoors; X2 = wind speed (m/sec); X3 = toluene home levels. Y is the continuous response variable ("dependent") while X1, X2, , Xp as the predictor variables ("independent") [7]. Usually the questions of interest are how to predict Y on the basis of the X's and what is the "independent" influence of wind speed, i.e. corrected for home levels and other related variables? These questions can in principle be answered by multiple linear regression analysis.

In the multiple linear regression model, Y has normal distribution with mean

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e006.jpg

The model parameters β 0 + β 1 + +β ρ and σ must be estimated from data.

β 0 = intercept

β 1 β ρ = regression coefficients

σ = σ res = residual standard deviation

Interpretation of regression coefficients

In the equation Y = β 0 + β 1 1 + +βρXρ

β 1 equals the mean increase in Y per unit increase in Xi , while other Xi's are kept fixed. In other words βi is influence of Xi corrected (adjusted) for the other X's. The estimation method follows the least squares criterion.

If b 0 , b 1 , , bρ are the estimates of β 0 , β 1 , , βρ then the "fitted" value of Y is

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e007.jpg

In our example, the statistical packages give the following estimates or regression coefficients (bi) and standard errors (se) for toluene personal exposure levels.

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-i001.jpg

Then the regression equation for toluene personal exposure levels would be:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e001.jpg

The estimated coefficient for time spent outdoors (0.582) means that the estimated mean increase in toluene personal levels is 0.582 g/m 3 if time spent outdoors increases 1 hour, while home levels and wind speed remain constant. More precisely one could say that individuals differing one hour in the time that spent outdoors, but having the same values on the other predictors, will have a mean difference in toluene xposure levels equal to 0.582 µg/m 3 8 .

Be aware that this interpretation does not imply any causal relation.

Confidence interval (CI) and test for regression coefficients

95% CI for i is given by bi ± t0.975*se(bi) for df= n-1-p (df: degrees of freedom)

In our example that means that the 95% CI for the coefficient of time spent outdoors is 95%CI: - 0.19 to 0.49

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e002.jpg

As in example if we test the H0: β humidity = 0 and find P = 0.40, which is not significant, we assumed that the association between between toluene personal exposure and humidity could be explained by the correlation between humididty and wind speed 8 .

In order to estimate the standard deviation of the residual (Y Yfit), i.e. the estimated standard deviation of a given set of variable values in a population sample, we have to estimate σ

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e003.jpg

The number of degrees of freedom is df = n (p + 1), since p + 1 parameters are estimated.

The ANOVA table gives the total variability in Y which can be partitioned in a part due to regression and a part due to residual variation:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e004.jpg

With degrees of freedom (n 1) = p + (n p 1)

In statistical packages the ANOVA table in which the partition is given usually has the following format [6]:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-i001.jpg

SS: "sums of squares"; df: Degrees of freedom; MS: "mean squares" (SS/dfs); F: F statistics (see below)

As a measure of the strength of the linear relation one can use R. R is called the multiple correlation coefficient between Y, predictors (X1, Xp ) and Yfit and R square is the proportion of total variation explained by regression (R 2 =SSreg / SStot).

Test on overall or reduced model

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e005.jpg

In our example Tpers = β 0 + β 1 time outdoors + β 2 Thome +β 3 wind speed + residual

The null hypothesis (H 0 ) is that there is no regression overall i.e. β 1 = β 2 =+βρ = 0

The test is based on the proportion of the SS explained by the regression relative to the residual SS. The test statistic (F= MSreg / MSres) has F-distribution with df1 = p and df2 = n p 1 (F- distribution table). In our example F= 5.49 (P<0.01)

If now we want to test the hypothesis Ho: β 1 = β 2 = β 5 = 0 (k = 3)

In general k of p regression coefficients are set to zero under H0. The model that is valid if H 0 =0 is true is called the "reduced model". The Idea is to compare the explained variability of the model at hand with that of the reduced model.

The test statistic (F):

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e006.jpg

follows a F-distribution with df 1 = k and df 2 = n p 1.

If one or two variables are left out and we calculate SS reg (the statistical package does) and we find that the test statistic for F lies between 0.05 < P < 0.10, that means that there is some evidence, although not strong, that these variables together, independently of the others, contribute to the prediction of the outcome.

Assumptions

If a linear model is used, the following assumptions should be met. For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and the independent variables should be linear, and all observations should be independent. So the assumptions are: independence; linearity; normality; homoscedasticity. In other words the residuals of a good model should be normally and randomly distributed i.e. the unknown does not depend on X ("homoscedasticity") 2 , 4 , 6 , 9 .

Checking for violations of model assumptions

To check model assumptions we used residual analysis. There are several kinds of residuals most commonly used are the standardized residuals (ZRESID) and the studentized residuals (SRESID) [6]. If the model is correct, the residuals should have a normal distribution with mean zero and constant sd (i.e. not depending on X). In order to check this we can plot residuals against X. If the variation alters with increasing X, then there is violation of homoscedasticity. We can also use the Durbin-Watson test for serial correlation of the residuals and casewise diagnostics for the cases meeting the selection criterion (outliers above n standard deviations). The residuals are (zero mean) independent, normally distributed with constant standard deviation (homogeneity of variances) 4 , 6 .

To discover deviations form linearity and homogeneity of variables we can plot residuals against each predictor or against predicted values. Alternatively by using the PARTIAL plot we can assess linearity of a predictor variable. The partial plot for a predictor X 1 is a plot of residuals of Y regressed on other Xs and against residuals of Xi regressed on other X's. The plot should be linear. To check the normality of residuals we can use an histogram (with normal curve) or a normal probability plot 6 , 7 .

The goodness-of-fit of the model is assessed by studying the behavior of the residuals, looking for "special observations / individuals" like outliers, observations with high "leverage" and influential points. Observations deserving extra attention are outliers i.e. observations with unusually large residual; high leverage points: unusual x - pattern, i.e. outliers in predictor space; influential points: individuals with high influence on estimate or standard error of one or more β's. An observation could be all three. It is recommended to inspect individuals with large residual, for outliers; to use distances for high leverage points i.e. measures to identify cases with unusual combinations of values for the independent variables and cases that may have a large impact on the regression model. For influential points use influence statistics i.e. the change in the regression coefficients (DfBeta(s)) and predicted values (DfFit) that results from the exclusion of a particular case. Overall measure for influence on all β's jointly is "Cook's distance" (COOK). Analogously for standard errors overall measure is COVRATIO 6 .

Deviations from model assumptions

We can use some tips to correct some deviation from model assumptions. In case of curvilinearity in one or more plots we could add quadratic term(s). In case of non homogeneity of residual sd, we can try some transformation: log Y if Sres is proportional to predicted Y; square root of Y if Y distribution is Poisson-like; 1/Y if Sres 2 is proportional to predicted Y; Y 2 if Sres 2 decreases with Y. If linearity and homogeneity hold then non-normality does not matter if the sample size is big enough (n≥50- 100). If linearity but not homogeneity hold then estimates of β's are correct, but not the standard errors. They can be corrected by computing the "robust" se's (sandwich, Huber's estimate) 4 , 6 , 9 .

Selection methods for Linear Regression modeling

There are various selection methods for linear regression modeling in order to specify how independent variables are entered into the analysis. By using different methods, a variety of regression models from the same set of variables could be constructed. Forward variable selection enters the variables in the block one at a time based on entry criteria. Backward variable elimination enters all of the variables in the block in a single step and then removes them one at a time based on removal criteria. Stepwise variable entry and removal examines the variables in the block at each step for entry or removal. All variables must pass the tolerance criterion to be entered in the equation, regardless of the entry method specified. A variable is not entered if it would cause the tolerance of another variable already in the model to drop below the tolerance criterion 6 . In a model fitting the variables entered and removed from the model and various goodness-of-fit statistics are displayed such as R2, R squared change, standard error of the estimate, and an analysis-of-variance table.

Relative issues

Binary logistic regression models can be fitted using either the logistic regression procedure or the multinomial logistic regression procedure. An important theoretical distinction is that the logistic regression procedure produces all statistics and tests using data at the individual cases while the multinomial logistic regression procedure internally aggregates cases to form subpopulations with identical covariate patterns for the predictors based on these subpopulations. If all predictors are categorical or any continuous predictors take on only a limited number of values the mutinomial procedure is preferred. As previously mentioned, use the Scatterplot procedure to screen data for multicollinearity. As with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. If all of your predictor variables are categorical, you can also use the loglinear procedure.

In order to explore correlation between variables, Pearson or Spearman correlation for a pair of variables r (Xi, Xj) is commonly used. For each pair of variables (Xi, Xj) Pearson's correlation coefficient (r) can be computed. Pearsons r (Xi; Xj) is a measure of linear association between two (ideally normally distributed) variables. R 2 is the proportion of total variation of the one explained by the other (R 2 = b * Sx/Sy), identical with regression. Each correlation coefficient gives measure for association between two variables without taking other variables into account. But there are several useful correlation concepts involving more variables. The partial correlation coefficient between Xi and Xj, adjusted for other X's e.g. r (X1; X2 / X3). The partial correlation coefficient can be viewed as an adjustment of the simple correlation taking into account the effect of a control variable: r(X ; Y / Z ) i.e. correlation between X and Y controlled for Z. The multiple correlation coefficient between one X and several other X's e.g. r (X1 ; X2 , X3 , X4) is a measure of association between one variable and several other variables r (Y ; X1, X2, , Xk). The multiple correlation coefficient between Y and X1, X2,, Xk is defined as the simple Pearson correlation coefficient r (Y ; Yfit) between Y and its fitted value in the regression model: Y = β0 + β1X1+ βkXk + residual. The square of r (Y; X1, , Xk ) is interpreted as the proportion of variability in Y that can be explained by X1, , Xk. The null hypothesis [H 0 : ρ ( : X1, , Xk) = 0] is tested with the F-test for overall regression as it is in the multivariate regression model (see above) 6 , 7 . The multiple-partial correlation coefficient between one X and several other X`s adjusted for some other X's e.g. r (X1 ; X2 , X3 , X4 / X5 , X6 ). The multiple partial correlation coefficient equal the relative increase in % explained variability in Y by adding X1,, Xk to a model already containing Z1, , Zρ as predictors 6 , 7 .

Other interesting cases of multiple linear regression analysis include: the comparison of two group means. If for example we wish to answer the question if mean HEIGHT differs between men and women? In the simple linear regression model:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e001.jpg

Testing β1 = 0 is equivalent with testing

HEIGHT MEN sub> = HEIGHT WOMEN by means of Student's t-test

The linear regression model assumes a normal distribution of HEIGHT in both groups, with equal . This is exactly the model of the two-sample t-test. In the case of comparison of several group means, we wish to answer the question if mean HEIGHT differ between different SES classes?

SES: 1 (low); 2 (middle) and 3 (high) (socioeconomic status)

We can use the following linear regression model:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e002.jpg

Then β 1 and β 2 are interpreted as:

β 1 = difference in mean HEIGHT between low and high class

β 2 = difference in mean HEIGHT between middle and high class

Testing β 1 = β 2 = 0 is equivalent with the one-way ANalysis Of VAriance F-test . The statistical model in both cases is in fact the same 4 , 6 , 7 , 9 .

Analysis of covariance (ANCOVA)

If we wish to compare a continuous variable Y (e.g. HEIGHT) between groups (e.g. men and women) corrected (adjusted or controlled) for one or more covariables X (confounders) (e.g. X = age or weight) then the question is formulated: Are means of HEIGHT of men and women different, if men and women of equal weight are compared? Be aware that this question is different from that if there is a difference between the means of HEIGHT for men and women? And the answers can be quite different! The difference between men and women could be opposite, larger or smaller than the crude if corrected. In order to estimate the corrected difference the following multiple regression model is used:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e003.jpg

where Y: response variable (for example HEIGHT); Z: grouping variable (for example Z = 0 for men and Z = 1 for women); X: covariable (confounder) (for example weight).

So, for men the regression line is y = β 0 + β 2 and for women is y = (β 0 + β 1 ) + β 2 .

This model assumes that regression lines are parallel. Therefore β 1 is the vertical difference, and can be interpreted as the: for X corrected difference between the mean response Y of the groups. If the regression lines are not parallel, then difference in mean Y depends on value of X. This is called "interaction" or "effect modification" .

A more complicated model, in which interaction is admitted, is:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e004.jpg

regression line men: y = β 0 + β 2

regression line women: y = (β 0 + β 1 )+ (β 2 + β 3 )X

The hypothesis of the absence of "effect modification" is tested by H 0 : 3 = 0

As an example, we are interested to answer what is - the corrected for body weight - difference in HEIGHT between men and women in a population sample?

We check the model with interaction:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e005.jpg

By testing β 3 =0, a p-value much larger than 0.05 was calculated. We assume therefore that there is no interaction i.e. regression lines are parallel. Further Analysis of Covariance for ≥ 3 groups could be used if we ask the difference in mean HEIGHT between people with different level of education (primary, medium, high), corrected for body weight. In a model where the three lines may be not parallel we have to check for interaction (effect modification) 7. Testing the hypothesis that coefficient of interactions terms equal 0, it is reasonable to assume a model without interaction. Testing the hypothesis H 0 : β 1 = β 2 = 0, i.e. no differences between education level when corrected for weight, gives the result of fitting the model, for which the P-values for Z 1 and Z 2 depend on your choice of the reference group. The purposes of ANCOVA are to correct for confounding and increase of precision of an estimated difference.

In a general ANCOVA model as:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e006.jpg

where Y the response variable; k groups (dummy variables Z 1 , Z 2 , , Z k-1 ) and X 1 , , X p confounders

there is a straightforward extension to arbitrary number of groups and covariables.

Coding categorical predictors in regression

One always has to figure out which way of coding categorical factors is used, in order to be able to interpret the parameter estimates. In "reference cell" coding, one of the categories plays the role of the reference category ("reference cell"), while the other categories are indicated by dummy variables. The β's corresponding to the dummies that are interpreted as the difference of corresponding category with the reference category. In "difference with overall mean" coding in the model of the previous example: [Y = β 0 + β 1 1 + β 2 2 ++ residual], the β 0 is interpreted as the overall mean of the three levels of education while β 1 and β 2 are interpreted as the deviation of mean of primary and medium from overall mean, respectively. The deviation of the mean of high level from overall mean is given by (- β 1 - β 2 ). In "cell means" coding in the previous model (without intercept): [Y = β 0 + β 1 1 + β 2 2 + β 3 3 + residual], β 1 is the mean of primary, β 2 the middle and β 3 of the high level education 6 , 7 , 9 .

Conclusions

It is apparent to anyone who reads the medical literature today that some knowledge of biostatistics and epidemiology is a necessity. The goal in any data analysis is to extract from raw information the accurate estimation. But before any testing or estimation, a careful data editing, is essential to review for errors, followed by data summarization. One of the most important and common question is if there is statistical relationship between a response variable (Y) and explanatory variables (Xi). An option to answer this question is to employ regression analysis. There are various types of regression analysis. All these methods allow us to assess the impact of multiple variables on the response variable.

  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Education, income...

Education, income inequality, and mortality: a multiple regression analysis

  • Related content
  • Peer review
  • Andreas Muller , professor ( axmuller{at}ualr.edu )
  • Department of Health Services Administration, University of Arkansas at Little Rock, 207 Ross Hall, 2801 South University Ave, Little Rock, AR 72204, USA
  • Accepted 4 October 2001

Objective: To test whether the relation between income inequality and mortality found in US states is because of different levels of formal education.

Design: Cross sectional, multiple regression analysis.

Setting: All US states and the District of Columbia (n=51).

Data sources: US census statistics and vital statistics for the years 1989 and 1990.

Main outcome measure: Multiple regression analysis with age adjusted mortality from all causes as the dependent variable and 3 independent variables—the Gini coefficient, per capita income, and percentage of people aged ≥18 years without a high school diploma.

Results: The income inequality effect disappeared when percentage of people without a high school diploma was added to the regression models. The fit of the regression significantly improved when education was added to the model.

Conclusions: Lack of high school education accounts for the income inequality effect and is a powerful predictor of mortality variation among US states.

What is already known on this topic

What is already known on this topic Aggregate studies have shown a positive relation between income inequality and mortality and three possible explanations have been suggested (relative deprivation, absolute deprivation, and aggregation artefact)

Income inequality may reflect the effects of other socioeconomic variables that are also related to mortality

What this study adds

What this study adds Multiple regression analysis of the 50 US states and District of Columbia for 1989-90 indicates that the relation between income inequality and age adjusted mortality is due to differences in high school educational attainment: education absorbs the income inequality effect and is a more powerful predictor of variation in mortality among US states

Lack of high school education seems to affect mortality by economic resource deprivation, risk of occupational injury, and learnt risk behaviour. It may also measure the lifetime, cumulative effect of adverse socioeconomic conditions

Introduction

Several recent studies have reported a positive relation between income inequality and mortality. The association has been observed in US metropolitan areas and states and, to varying degrees, in international studies. 1 – 3 The relation remains intact when different measures of income inequality are used. The critical question is how this relation should be interpreted.

Three competing interpretations have been advanced. Wilkinson believes that income inequality produces psychosocial stresses for individuals placed at lower ranks of the socioeconomic hierarchy. 4 – 6 Continuous stress due to deprivation of status will lead to deteriorating health and higher mortality over time. The fact that median or per capita household income cannot account for the relation has been taken as evidence that “relative income,” or income inequality, is more important than absolute income for human health and longevity.

Gravelle argues that the correlation between income inequality and mortality may be artefactual in part. 7 He shows mathematically that the aggregate relation is consistent with a negative, curvilinear relation between income and the probability of dying for individuals. Wolfson et al's clever test of Gravelle's hypothesis indicates, however, that the individual relation between income and mortality cannot fully account for the aggregate relationship. 8

The “neo-material” interpretation asserts that income inequality reflects individual and community forms of absolute deprivation. Lynch et al argue that poorer individuals disproportionately experience health taxing events and lack of resources throughout their lives. 9 They live in deprived communities characterised by “underinvestment” in the social and physical infrastructure. Both forms of deprivation produce cumulative wear and tear. The experience depletes health, resulting in higher mortality for those in lower socioeconomic strata. The aggregate effect is that societies with increasing income inequality will experience higher mortality than they would otherwise. Lynch et al suggest that material conditions may be sufficient in explaining the relation between income inequality and mortality.

The neo-material interpretation gives only a broad indication of which material circumstances are important. Kaplan et al's analysis of US states, however, suggests some potential answers. 2 They report that income inequality is significantly correlated with certain risk factors (homicide rates and unemployment rates), social resources (food stamps and lack of health insurance), and measures of human capital (educational attainment). The substantial correlations with some measures of human capital imply that income inequality may not have a direct effect on mortality. Instead, income inequality may reflect the effects of other socioeconomic variables that are also related to mortality. Among those variables, the contribution of formal education deserves most attention since it typically precedes work and income. It is also related to mortality.

Higher educational degrees are typical prerequisites for highly compensated work in the United States and other industrialised nations. According to US census data for the year 1998, the median earnings of adult, year round workers with professional degrees are about four times higher than those of adults who had not completed high school. 10 Thus, the level of education ought to be correlated with cumulative income, which is the basis for measuring income inequality.

In addition, more schooling seems to extend life. 11 – 14 In econometric studies years of schooling typically had a stronger negative effect on age adjusted mortality than per capita income when other measures were controlled for. Therefore, the association between income inequality and mortality found in aggregate studies may be partially the result of variation in educational attainment. I tested this hypothesis using data for the US states, which have shown substantial associations between measures of income inequality measures and age adjusted mortality.

Data and methods

The study is based on a cross sectional analysis of US census statistics and vital statistics for the years 1989 and 1990 for all US states including the District of Columbia (n=51). Age adjusted mortality from all causes was the main dependent variable of the analysis. 15 I used the CDC WONDER data extraction tool to standardise the age specific death rates by the direct method, 15 using the US age distribution for 1990 as the standard population. The data were pooled for the years 1989 and 1990 to make death rates more reliable.

The Gini coefficient for households was the main independent variable of interest. 16 This measures the difference between the areas under the curve of a graph of actual distribution of cumulative income and one indicating equality of income distribution. The Gini coefficient ranges from 0 to 1 and measures the degree of income inequality. A value of 0 indicates that each household obtains the same amount of income, while a value of 1 indicates that only one household earns all income. 17 18

To control for varying income levels among states, I included the per capita income of all people in the regression model. 19 The per capita income variable was log (ln) transformed to reduce positive skew. Both income variables pertain to the calendar year 1989. I measured educational attainment by the percentage of people aged ≥18 years without a high school diploma in 1990. 20

I analysed age adjusted mortality by multiple regression. 21 The proportion of the population living in each state in 1990 was the weighting factor, and STATISTICA software 22 estimated the regression models.

Fig 1 shows the relation between the measure of income inequality and age adjusted mortality. The scatterplot indicates a positive linear relation, with the District of Columbia being an apparent outlier. The range in income inequality between states was about 0.1. The regression coefficient indicates that a 0.1 unit increase in the Gini coefficient was associated with an increase of 1.6 deaths per 1000 population.

Age adjusted death rates by Gini coefficient for the 50 US states and the District of Columbia (DC), 1989-90 (y=1.831+15.705×x; R 2 =0.24; weighted regression). (Data sources US Public Health Service 15 and US Census Bureau 16 )

  • Download figure
  • Open in new tab
  • Download powerpoint

Fig 2 shows a positive, linear relation between education and age adjusted mortality. The observations cluster around the regression line except for the District of Columbia. The range in the education variable was about 20 percentage points. The related increase in age adjusted mortality was about 2.1 deaths per 1000 population.

Age adjusted death rates by educational attainment for the 50 US states and the District of Columbia (DC), 1989-90 (y=6.16+0.103×x; R 2 =0.51; weighted regression). (Data sources US Public Health Service 15 and US Census Bureau 20 )

Fig 3 presents the percentage of variation in age adjusted mortality explained by five regression specifications. All regression models were statistically significant at P<0.001. The two income measures accounted for 27.7% of the variation in age adjusted mortality. Lack of high school education by itself explained over half of the variation in the dependent variable. The regression coefficients for both income variables were non-significant when added to a model including the education measure: they accounted for no additional variation in the dependent variable when the education variable was controlled. The adjusted R 2 values slightly decreased with the addition of the income measures, since the adjustment corrects for redundancy. Deleting the District of Columbia from the analysis improved the fit of regression specifications, including education, in the model but did not substantively change the results shown in fig 3 .

Percentage of variation in age adjusted mortality explained by education and income variables for the 50 US states and District of Columbia, 1989-90

Subgroup analysis

A preliminary analysis of age specific mortality indicated that the findings might best reflect the experience of people aged ≥45 years. For the 15-44 year age group, the Gini coefficient was significant and positively related to age specific death rates, whereas the education variable was only marginally significant. Since the analysis did not restrict the age range of the independent variables to people aged 15-44, the results might be biased. Deaths for 15-44 year olds comprised 8.3% of all US deaths in 1989-90, with accidental and violent deaths among the leading causes.

The definition of the education variable excludes children. Therefore, I estimated all regressions with the dependent variable restricted to people aged ≥20 years. The results of the analysis paralleled those in fig 3 , with model fit reduced by 1 to 3 percentage points.

Gini coefficients for individual states were not available by householder's race or sex. As an alternative, I included the percentage of African-American and Latino people in populations in a regression model that included education, per capita income, and the Gini coefficient. The variable measuring the effect of belonging to economically depressed minorities was significant (b=0.03; t=3.26) and reduced the direct education effect to b=0.07 (t=2.97).

I also ran the regressions for each sex. The education and income variables predicted age adjusted mortality for males better (R 2 adj=0.54) than for females (R 2 adj=0.34). However, the results of the sex specific analyses were consistent with those in fig 3 .

This study had two main findings. Income inequality, as measured by the Gini coefficient, had no unique effect on US age adjusted mortality when the level of formal education was controlled for. Educational attainment, as measured by lack of completed high school education, was a more powerful predictor of differences in mortality than income inequality in US states.

Over a decade has passed since the 1990 US census was taken. Therefore, my findings may not be applicable today. When data on income inequality and vital statistics are released for individual states for the years 1999-2000 this concern can be examined.

The potential role of education has been overlooked in previous research on income inequality and mortality, 1 2 which focused more on the potentially contaminating effects of income and poverty. In my analysis I did not directly control for poverty, but the effect of poverty was not excluded. It was indirectly reflected in the per capita income and education measures.

Implications of results

Lack of high school education completely captured the income inequality effect and income level effect in my age adjusted analysis. This finding suggests that physical and social conditions associated with low levels of education may be sufficient for interpreting the relation between income inequality and mortality. My results therefore seem to support the idea that absolute deprivation rather than relative deprivation is important for influencing mortality.

One reviewer pointed out that this view might be too narrow. The income inequality measure might also express the “burden of relative deprivation” in society, as discussed by Marmot and Wilkinson. 23 Lack of high school education may indicate low status, which, by definition, implies a relative position in the social hierarchy. However, low educational status may indicate only lack of material resources and other adverse life circumstances. It remains to be seen whether low educational status produces the additional stressful, invidious hierarchical comparisons that lead to poorer health and greater mortality. Since aggregate data are not well suited for examining hypotheses at the individual level, my study cannot confirm or rule out the importance of psychosocial processes.

An expanded regression analysis (available on request) indicated that lack of high school education was related to lack of health insurance, belonging to economically depressed minority groups, working in jobs with high risk of injury, and smoking. This finding suggests that lack of material resources, occupational exposure to risk, and certain learnt health risk behaviour might be reflected in the large education-mortality effect.

Less educated people may be concentrated in areas that are more risky to life and health. Some research has suggested that these communities may lack sufficient investment in health related infrastructure such as access to health care, proper police protection, and healthy housing. 24 These potential risk factors are only indirectly assessed by the variables used in my study.

Lack of high school education may also represent lifetime effects of socioeconomic deprivation. Davey Smith et al found that socioeconomic conditions during childhood adversely affected adult mortality in a large, prospective study of adult Scottish men. 25 My study could not determine intergenerational effects of educational attainment. However, this path of research seems promising since considerable linkage between parents and offspring have been seen for educational attainment and for incomes in Britain 26 and in the United States. 27 28 Lack of high school education may also capture the lifetime effect of adverse social conditions increasing mortality. Income inequality is only one aspect of this broader experience.

Acknowledgments

I thank Drs Wilkinson, Davey Smith, and Altman for their valuable comments on an earlier version of this paper.

Funding My study was supported by my sabbatical leave granted by the University of Arkansas at Little Rock.

Competing interests None declared.

  • Kaplan GA ,
  • Balfour JL ,
  • Wolfson MC ,
  • Berthelot JM ,
  • Wilkinson RG
  • Wolfson M ,
  • Davey Smith G ,
  • US Census Bureau
  • Tenleckyi NE
  • Backlund E ,
  • Sorlie PD ,
  • US Public Health Service, Centers for Disease Control
  • Atkinson AB
  • Weinberg DH
  • Kawachi I ,
  • Kennedy BP ,
  • Lochner K ,
  • Prothrow-Stith D
  • McMurrer DP ,

example of research using multiple regression analysis

Introduction to Research Methods

15 multiple regression.

In the last chapter we met our new friend (frenemy?) regression, and did a few brief examples. And at this point, regression is actually more of a roommate. If you stay in the apartment (research methods) it’s gonna be there. The good thing is regression brings a bunch of cool stuff for the apartment that we need, like a microwave.

15.1 Concepts

Let’s begin this chapter with a bit of a mystery, and then use regression to figure out what’s going on.

What would you predict, just based on what you know and your experiences, the relationship between the number of computers at a school and their math test scores is? Do you think schools with more computers do worse or better?

Computers might be useful for teaching math, and are typically more available in wealthier schools. Thus, I would predict that the number of computers at a school would predict higher scores on math tests. We can use the data on California schools to test that idea.

Oh. Interesting. The relationship is insignificant, and perhaps most surprisingly, negative. Schools with more computers did worse on the test in the sample. For each additional computer there was at a school, scores on the math test decreased by .001 points, and that result is not significant.

So computers don’t make much of a difference. Are computers distracting the test takers? Diminishing their skills in math? My old math teachers were always worried about us using calculators too much. Maybe, but maybe it’s not the computers fault.

Let’s ask a different question then.

What do you think the relationship is between the number of computers at a school and the number of students? Larger schools might not have the same number of computers per student, but if you had to bet money would you think the school with 10,000 students or 1000 students would have more computers?

If you’re guessing that schools with more students have more computers, you’d be correct. The correlation coefficient for the number of students and computers is .93 (very strong), and we can see that below in the graph.

example of research using multiple regression analysis

More students means more computers. In the regression we ran though all it knows is that schools with more computers do worse on math, but they can’t tell why. If larger schools have more computers AND do worse on tests, a bivariate regression can’t separate those effects on its own. We did bivariate regression in the last chapter, where we just look at two variables, one independent and one dependent (bivariate means two (bi) variables (variate)).

Multiple regression can help us try though. Multiple regression doesn’t mean running multiple regressions, it refers to including multiple variables in the same regression. Most of the tools we’ve learned so far only allow for two variables to be used, but with regression we can use many (many) more.

Let’s see what happens when we look at the relationship between the number of computers and math scores, controlling for the number of students at the school.

This second regression shows something different. In the earlier regression, the number of computers was negative and not significant. Now? Now it’s positive and significant. So what happened?

We controlled for the number of students that are at the school, at the same time that we’re testing the relationship between computers and math scores. Don’t worry if that’s not clear yet, we’re going to spend some time on it. When I say “holding the number of students constant” it means comparing schools with different numbers of computers but that have the same number of students. If we compare two schools with the same number of students, we can then better identify the impact of computers.

We can interpret the variables in the same way as earlier when just testing one variable to some degree. We can see that a larger number of computers is associated with higher test scores, and that larger schools generally do worse on the math test.

Specifically, a one unit increase in computers is associated with an increase of math scores of.02 points, and that change is highly significant.

But our interpretation needs to add something more. With multiple regression what we’re doing is looking at the effect of each variable, while holding the other variable constant.

Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant , and that change is highly significant.

When we look at the effect of computers in this regression, we’re setting aside the impact of student enrollment and just looking at computers. And when we look at the coefficient for students, we’re setting aside the impact of computers and isolating the effect of larger school enrollments on test scores.

We looked at scatter plots and added a line to the graph to better understand the direction of relationships in the previous chapter. We can do that again, but it’s slightly different.

Here is the relationship of computers to math scores, and the relationship of computers to math scores holding students constant. That means we’re actually producing separate lines for both variables, but we’re doing that after accounting for the impact of computers on school enrollment, and school enrollment on computers.

example of research using multiple regression analysis

We can also graph it in 3 dimensions, where we place the outcome on the z axis coming out of the paper/screen towards you.

example of research using multiple regression analysis

But I’ll be honest, that doesn’t really clarify it for me. Multiple regression is still about drawing lines, but it’s more of a theoretical line. It’s really hard to actually effectively draw lines as we move beyond two variables or two dimensions. Hopefully that logic of drawing a line and the equation of a line still makes sense for you, because it’s the same formula we use in interpreting multiple regressions.

What we’re figuring out with multiple regression is what part of math scores is determined uniquely by the student enrollment at a school and what part of math scores is determined uniquely by the number of computers. Once R figures that out it gives us the slope of two lines, one for computers and one for students. The line for computers slopes upwards, because the more computers a school has the better it’s students do, when we hold constant the number of students at the school. When we hold constant the number of computers, larger schools do worse on the math test.

I don’t expect that to fully make sense yet. Understanding what it means to “hold something constant” is pretty complex and theoretical, but it’s also important to fully utilizing the powers of regression. What this example illustrates though is the dangers inherent in using regression results, and the difficulty of using them to prove causality.

Let’s go back to the bivariate regression we did, just including the number of computers at a school and math test scores. Did that prove that computers don’t impact scores? No, even though that would be the correct interpretation of the results. But lets go back to what we need for causality…

  • Co-variation
  • Temporal Precedence
  • Elimination of Extraneous Variables or Hypotheses

We failed to eliminate extraneous variables. We tested the impact of computers, but we didn’t do anything to test any other hypotheses of what impacts math scores. We didn’t test whether other factors that impact scores (number of teachers, wealth of parents, size of the school) had a mediating relationship on the number of computers. Until we test every other explanation for the relationship, we haven’t really proven anything about computers and test scores. That’s why we need to take caution in doing regression. Yes, you can now do regression, and you can hopefully correctly interpret them. But correctly interpreting a regression, and doing a regression that proves something is a little more complicated. We’ll keep working towards that though.

15.1.1 Predicting Wages

To this point the book has attempted to avoid touching on anything that is too controversial. Statistics is a math, so it’s a fairly apolitical field, but it can be used to support political or controversial matters. We’re going to wade into one in this chapter, to try and show the way that statistics can let us get at some of the thorny issues our world deals with. In addition, this example should help to clarify what it means to “hold something constant”.

We’ll work with the same income data we used in the last chapter from the Panel Study of Income Dynamics from 1982. Just to remind you, these are the variables we have available.

  • experience - Years of full-time work experience.
  • weeks - Weeks worked.
  • occupation - factor. Is the individual a white-collar (“white”) or blue-collar (“blue”) worker?
  • industry - factor. Does the individual work in a manufacturing industry?
  • south - factor. Does the individual reside in the South?
  • smsa - factor. Does the individual reside in a SMSA (standard metropolitan statistical area)?
  • married - factor. Is the individual married?
  • gender - factor indicating gender.
  • union - factor. Is the individual’s wage set by a union contract?
  • education - Years of education.
  • ethnicity - factor indicating ethnicity. Is the individual African American (“afam”) or not (“other”)?
  • wage - Wage.

Let’s say we wanted to understand wage discrimination on the basis of race or ethnicity Do African Americans earn less than others in the workplace? Let’s see what this data tells us.

And a note before we begin. The variable ethnicity has two categories, “afam” which indicates African American or “other” which means anything but African American. Obviously, that captures a lot modernly, but in the 1980 that generally can be understood to generally be white people. I’ll generally just refer to it as other races in the text though.

ethnicity wage
other 1174
afam 808.5

The average wage for African Americans in the data is 808.5, and for others the average wage is 1174. That means that African Americans earn (in this really specific data set) 61% of how much men earn 365.5 less.

Let’s say we take that fact to someone that doesn’t believe that African Americans are discriminated against. We’ll call them you’re “contrarian friend”, you can fill in other ideas of what you’d think about that person. What will their response be? Probably that it isn’t evidence of discrimination, because of course African Americans earn less, they’re less likely to work in white collar jobs. And people that work in white collar jobs earn more, so that’s the reason African Americans earn less. It’s not discrimination, it’s just that they work different jobs.

And on the surface, they’d be right. African Americans are more likely to work in blue collar jobs (65% to 50%), and blue collar jobs earn less (956 for blue collar jobs to 1350 for white collar jobs).

ethnicity blue_collar
other 0.5018
afam 0.6512
blue_collar wage
0 1350
1 956.4

So what we’d want to do then is compare African Americans to others that both work blue collar jobs, and African Americans to others working white collar jobs. If there is a difference in wages between two people working the same job, that’s better evidence that the pay gap is a result not of their occupational choices but their race.

We can visualize that with a two by two chart.

Let’s work across that chart to see what it tells us. A 2 by 2 chart like that is called a cross tab because it let’s us tab ulate figures a cross different characteristics of our data. They can be a methodologically simple way (we’re just showing means/averages there) to tell a story if the data is clear.

So what do we learn? Looking at the top row, white collar workers that are labeled other for ethnicity earn on average $1373. And white collar workers that are African American earn $918. Which means that for white collar workers, African Americans earn $455 less. For blue collar workers, other races earn $977, while African Americans earn $749. That’s a gap of $228. So the size of the gap is different depending on what a persons job is, but African American’s earn less regardless of their job. So it isn’t just that African Americans are less likely to work white collar jobs that drives their lower wages. Even those in white collar jobs earn less. In fact, African Americans in white collar jobs earn less on average than other races working blue collar jobs!

This is what it means to hold something constant. In that table above we’re holding occupation constant, and comparing people based on their race to people of another race that work the same job. So differences in those jobs aren’t influencing our results now, we’ve set that effect aside for the moment.

And we can do that automatically with regression, like we did when we looked at the effect of computers on math scores, while holding the impact of school enrollment constant.

Based on that regression results, African Americans earn $309 less than other races when holding occupation constant, and that effect is highly significant. And blue collar workers earn $380 less than white collar workers when holding race constant, and that effect is significant too.

So have we proven discrimination in wages? Probably not yet for the contrarian friend. Without pause they’ll likely say that education is also important for wages, and African Americans are less likely to go to college. And in the data they’d be correct. On average African Americans completed 11.65 years of education, and other races completed 12.94.

ethnicity education
other 12.94
afam 11.65

So let’s add that to our regression too.

Now with the ethnicity variable we’re comparing people of different ethnicities that have the same occupation and education. And what do we find? Even holding both of those constant, we would expect an African American worker to earn $262 less, and that is highly significant.

What your contrarian friend is doing is proposing alternative variables and hypotheses that explain the gap in earnings for African Americans. And while those other things do make a difference they don’t explain fully why African Americans earn less than others. We have shrunk the gap somewhat. Originally the gap was 465, which fell to 309 when we held occupation constant and now 262 with the inclusion of education. So those alternative explanations do explain a portion of why African Americans earned less, it was because they had lower-status jobs and less education (setting aside the fact that their lower-status jobs and less education may be the result of discrimination).

So what else do we want to include to try and explain that difference in wages? We can insert all of the variables in the data set to see if there is still a gap in wages between African Americans and others.

Controlling for occupation, education, experience, weeks worked, the industry, the region of employment, whether they are married, their gender, and their union status, does ethnicity make a difference in earnings? Yes, if you found two workers that had the same values for all of those variables except that they were of different races, the African American would still likely earn less.

In our regression African Americans earn $167 less when holding occupation, education, experience, weeks worked, the industry, region, marriage, gender, and their union status constant, and that effect is still statistically significant.

The contrarian friend may still have another alternative hypothesis to attempt to explain away that result, but unfortunately that’s all the data will let us test.

What we’re attempting to do is minimize what is called the missing variable bias . If there is a plausible story that explains our result, whether one is predicting math test scores or wages or whatever else, if we fail to account for that explanation our model may be misleading. It was misleading to say that computers don’t increase math test scores when we didn’t control for the effect of larger school sizes.

What missing variables do we not have that may explain the difference in earnings between African Americans and others? We don’t know who is a manager at work or anything about job performance, and both of those should help explain why people earn more. So we haven’t removed our missing variable bias, the evidence we can provide is limited by that. But based on the evidence we can generate, we find evidence of racial discrimination in wages.

And I should again emphasize, even if something else did explain the gap in earnings between African Americans and others it wouldn’t prove there wasn’t discrimination in society. If differences in occupation did explain the racial gap in wages, that wouldn’t prove the discrimination didn’t push African Americans towards lower paying jobs.

But the work we’ve done above is similar to what a law firm would do if bringing a lawsuit against a large employer for wage discrimination. It’s hard to prove discrimination in individual cases. The employer will always just argue that John is a bad employee, and that’s why they earn less than their coworkers. Wage discrimination suits are typically brought as class action suits, where a large group of employees sues based on evidence that even when accounting for differences in specific job, and job performance, and experience, and other things there is still a gap in wages.

I should add a note about interpretation here. It’s the researcher that has to identify what they different coefficients means in the real world. We can talk about discrimination because of differences in earnings for African Americans and others, but we wouldn’t say that blue collar workers are discriminated against because they earn less than white collar workers. It’s unlikely that someone would say that people with more experience earning more is the result of discrimination. These are interpretations that we layer on to the analysis based on our expectations and understanding of the research question.

15.1.2 Predicting Affairs

Regression can be used to make predictions and learn more about the world in all sorts of contexts. Let’s work through another example, with a little more focus on the interpretation.

We’ll use a data set called Affairs, which unsurprisingly has data about affairs. Or more specifically, about people, and whether or not they have had an affair.

In the data set there are 10 variables.

  • affairsany - coded as 0 for those who haven’t had an affair and 1 for those who have had any number of affairs. This will be the dependent variable.
  • gender - either male or female
  • age - respondents age
  • yearsmarried - number of years of current marriage
  • children - are there children from the marriage
  • religiousness - scaled from 1-5, with 1 being anti religion and 5 being very religious
  • education -years of education
  • occupation - 1-7 based on a specific system of rating for occupations
  • rating 1-5 based on how happy the respondent reported their marriage being.

So we can throw all of those variables into a regression and see which ones have the largest impact on the likelihood someone had an affair. But before that we should pause to make predictions. We shouldn’t just include a variable just for laughs - we should have a reason for including it. We should be able to make a prediction for whether it should increase or decrease the dependent variable.

So what effect do you think these independent variables will have on the chances of someone having had an affair?

  • gender - I would guess their (on average) higher libidos and lower levels of concern about childbearing will lead to more affairs.
  • age - Young people are typically a little less ready for long term commitments, and a bit more irrational and willing to take chances, so age should decrease affairs. Although being older does give you more time to of had an affair. *yearsmarried - Longer marriages should be less likely to contain an affair. If someone was going to have an affair, i would expect it to happen earlier, and such things often end marriages.
  • children - Children, and avoiding hurting them, are hopefully a good reason for people to avoid having affairs.
  • religiousness - most religions teach that affairs are wrong, so I would guess people that are more religious are less likely to have affairs
  • education and occupation - I actually can’t make a prediction for what effect education or occupation have on affairs, and since I don’t think they’ll impact the dependent variable I wouldn’t include them in the analysis if I was doing this for myself. But I’ll keep them here as an example to talk about later.
  • rating - happier marriages will likely produce fewer affairs, in large part because it’s often unhappiness that makes couples stray.

Those arguments may be wrong or right. And they certainly wont be right in every case in the data - there will be counter examples. What I’ve tried to do is lay out predictions, or hypotheses, for what I expect the model to show us. Let’s test them all and see what predicts whether someone had an affair.

What do you see as the strongest predictors of whether someone had an affair? Let’s start by identifying what was highly statistically significant. Religiousness and rating both had p-values below .001, so we can be very confident that in the population people who are more religious and who report having happier marriages are both less likely to have affairs. Let’s interpret that more formally.

For each one unit increase in religiousness an individual’s chances of having an affair decrease by .05 holding their gender, age, years married, children, education, occupation and rating constant, and that change is significant.

That’s a long list of things we’re holding constant! When you get past 2 or 3 control variables, or when you’re describing different variables from the same model you can use “holding all else constant” in place of the list.

For each one unit increase in the happiness rating of a marriage an individual’s chances of having an affair decrease by .09, holding all else constant , and that change is significant.

What else that we included in the model is useful for predicting whether someone had an affair?

Age and years married both reach statistical significance. As individuals get older, their chances of having an affair decrease, as I predicted.

However, as their marriages get longer the chances of having had an affair increase, not decrease as I thought Interesting! Does that mean I should go back and change my prediction? No. What it likely means is that some of my assumptions were wrong, so I should update them and discuss why I was wrong (in the conclusion if this was a paper). If we only used regression to find things that we already know, we wouldn’t learn anything new. It’s still good that I made a prediction though because that highlights that the result is a little weird (to my eyes) or may be more surprising to the readers. Imagine if you found that a new jobs program actually lowered participants incomes, that would be a really important outcome of your research and just as valuable as if you’d found that incomes increase.

A surprising finding could also be evidence that there’s something wrong in the data. Did we enter years of marriage correctly, or did we possibly reverse it where longer marriages are actually coded as lower numbers. That’d be odd in this case, but it’s always worth thinking that possibility through. If I got data that showed college graduates earned less than those without a high school degree I’d be very skeptical of the data, because that would go against everything we know. It might just be an odd, fluky one-time finding, or it could be evidence something is wrong in the data.

Okay, what about everything else? All the other variables are insignificant. Should we remove them from the analysis, since they don’t have a significant effect on our dependent variable? It depends. Insignificant variables can be worth including in most cases in order to show that they don’t have an effect on the outcome. It’s worth knowing that gender and children don’t have an effect on affairs in the population. We had a reason to think they would, and it turns out they don’t really have much of an influence on whether someone has sex outside their marriage. That’s good to know.

I didn’t have a prediction for education or occupation though, and the fact they are insignificant means they aren’t really worth including. I’m not testing any interesting ideas about what affects affairs with those variables, they’re just being included because they’re in the data. That’s not a good reason for them to be there, we want to be testing something with each variable we include.

15.2 Practice

In truth, we haven’t done a lot of new work on code in this chapter. We’ve more so focused on this big idea of what it means to go from bivariate regression to multivariate regression. So we wont do a lot of practice, because the basic structure we learned in the last chapter drives most of what we’ll do.

We’ll read in some new data, that’s on Massachusetts schools and test scores there. It’s similar to the California Schools data, but from Massachusetts for variety.

We’ll focus on 4 of those variable, and try to figure out what predicts how schools do on tests in 8th grade (score8).

  • score8 - test scores for 8th graders
  • exptot - total spending for the school
  • english - percentage of students that don’t speak english as their native language
  • income - income of parents

Let’s start by practicing writing a regression to look at the impact of spending (exptot) on test scores.

That should look very similar to the last chapter. And we can interpret it the same way.

For each one unit increase in spending, we observe a .004 increase in test scores for 8th graders, and that change is significant.

Let’s add one more variable to the regression, and now include english along with exptot. To include an additional variable we just place a + sign between the two variables, as shown below.

Each one unit increase in spending is associated with a .007 increase in test scores for 8th graders, holding the percentage of english speakers constant, and that change is significant.

Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 4.1 decrease in test scores for 8th graders, holding the spending constant, and that change is significant.

And one more, let’s add one more variable: income.

Interesting, spending actually lost its significance in that final regression and change directions.

Each one unit increase in spending is associated with a .002 decrease in test scores for 8th graders when holding the percentage of english speakers and parental income constant, but that change is insignificant.

Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 2.2 decrease in test scores for 8th graders when holding spending and parental income constant, and that change is significant.

Each one unit increase in parental income is associated with a 2.8 increase in test scores for 8th graders when holding spending and the percentage of english speakers constant, and that change is significant.

The following video demonstrates the coding steps done above.

Logo for University of Southern Queensland

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Section 5.3: Multiple Regression Explanation, Assumptions, Interpretation, and Write Up

Learning Objectives

At the end of this section you should be able to answer the following questions:

  • Explain the difference between Multiple Regression and Simple Regression.
  • Explain the assumptions underlying Multiple Regression.

Multiple Regression is a step beyond simple regression. The main difference between simple and multiple regression is that multiple regression includes two or more independent variables – sometimes called predictor variables – in the model, rather than just one.

As such, the purpose of multiple regression is to determine the utility of a set of predictor variables for predicting an outcome, which is generally some important event or behaviour. This outcome can be designated as the outcome variable, the dependent variable, or the criterion variable. For example, you might hypothesise that the need to belong will predict motivations for Facebook use and that self-esteem and meaningful existence will uniquely predict motivations for Facebook use.

Before beginning your analysis, you should consider the following points:

  • Regression analyses reveal relationships among variables (relationship between the criterion variable and the linear combination of a set of predictor variables) but do not imply a causal relationship.
  • A regression solution – or set of predictor variables – is sensitive to combinations of variables. Whether a predictor is important in a solution depends on the other predictors in the set. If the predictor of interest is the only one that assesses some important facet of the outcome, it will appear important. If a predictor is only one of several predictors that assess the same important facet of the outcome, it will appear less important.  For a good set of predictor variables – the smallest set of uncorrelated variables is best.

PowerPoint: Venn Diagrams

Please click on the link labeled “Venn Diagrams” to work through an example.

  • Chapter Five – Venn Diagrams

In these Venn Diagrams, you can see why it is best for the predictors to be strongly correlated with the dependent variable but uncorrelated with the other Independent Variables. This reduces the amount of shared variance between the independent variables.  The illustration in Slide 2 shows logical relationships between predictors, for two different possible regression models in separate Venn diagrams. On the left, you can see three partially correlated independent variables on a single dependent variable. The three partially correlated independent variables are physical health, mental health, and spiritual health and the dependent variable is life satisfaction. On the right, you have three highly correlated independent variables (e.g., BMI, blood pressure, heart rate) on the dependent variable of life satisfaction. The model on the left would have some use in discovering the associations between those variables, however, the model on the right would not be useful, as all three of the independent variables are basically measuring the same thing and are mostly accounting for the same variability in the dependent variable.

There are two main types of regression with multiple independent variables:

  • Standard or Single Step: Where all predictors enter the regression together.
  • Sequential or Hierarchical:  Where all predictors are entered in blocks. Each block represents one step.

We will now be exploring the single step multiple regression:

All predictors enter the regression equation at once. Each predictor is treated as if it had been analysed in the regression model after all other predictors had been analysed. These predictors are evaluated by the shared variance (i.e., level of prediction) shared between the dependant variable and the individual predictor variable.

Multiple Regression Assumptions

There are a number of assumptions that should be assessed before performing a multiple regression analysis:

  • The dependant variable (the variable of interest) needs to be using a continuous scale.
  • There are two or more independent variables. These can be measured using either continuous or categorical means.
  • The three or more variables of interest should have a linear relationship, which you can check by using a scatterplot.
  • The data should have homoscedasticity. In other words, the line of best fit is not dissimilar as the data points move across the line in a positive or negative direction. Homoscedasticity can be checked by producing standardised residual plots against the unstandardized predicted values.
  • The data should not have two or more independent variables that are highly correlated. This is called multicollinearity which can be checked using Variance-inflation-factor or VIF values. High VIF indicates that the associated independent variable is highly collinear with the other variables in the model.
  • There should be no spurious outliers.
  • The residuals (errors) should be approximately normally distributed. This can be checked by a histogram (with a superimposed normal curve) and by plotting the of the standardised residuals using either a P-P Plot, or a Normal Q-Q Plot .

Multiple Regression Interpretation

For our example research question, we will be looking at the combined effect of three predictor variables – perceived life stress, location, and age – on the outcome variable of physical health?  

PowerPoint: Standard Regression

Please open the output at the link labeled “Chapter Five – Standard Regression” to view the output.

  • Chapter Five – Standard Regression

Slide 1 contains the standard regression analysis output.

image

On Slide 2 you can see in the red circle, the test statistics are significant.  The F-statistic examines the overall significance of the model, and shows if your predictors as a group provide a better fit to the data than no predictor variables, which they do in this example.

The R 2 values are shown in the green circle. The R 2 value shows the total amount of variance accounted for in the criterion by the predictors, and the adjusted R 2 is the estimated value of  R 2 in the population.  

Table with data on physical illness

Moving on to the individual variable effects on Slide 3, you can see the significance of the contribution of individual predictors in light blue. The unstandardized slope or the B value is shown in red, which represents the change caused by the variable (e.g., increasing  1 unit of perceived stress will raise physical illness by .40). Finally, you can see the standardised slope value in green, which are also known as beta values. These values are standardised ranging from +/-0 to 1, similar to an r value.  

We should also briefly discuss dummy variables:

Table on data on physical illness

A dummy variable is a variable that is used to represent categorical information relating to the participants in a study. This could include gender, location, race, age groups, and you get the idea.  Dummy variables are most often represented as dichotomous variables (they only have two values). When performing a regression, it is easier for interpretation if the values for the dummy variable is set to 0 or 1. 1 usually resents when a characteristic is present. For example, a question asking the participants “Do you have a drivers license” with a forced choice response of yes or no.

In this example on Slide 3 and circled in red, the variable is gender with male = 0, and female = 1. A positive Beta (B) means an association with 1, whereas a negative beta means an association with 0. In this case, being female was associated with greater levels of physical illness.

Multiple Regression Write Up

Here is an example of how to write up the results of a standard multiple regression analysis:

In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable. Overall, the results showed the utility of the predictive model was significant, F (3,363) = 39.61, R 2 = .25, p < .001. All of the predictors explain a large amount of the variance between the variables (25%).  The results showed that perceived stress and gender of participants were significant positive predictors of physical illness ( β =.47, t = 9.96, p < .001, and β =.15, t = 3.23, p = .001, respectively). The results showed that age ( β =-.02, t = -0.49 p = .63) was not a significant predictor of perceived stress.

Statistics for Research Students Copyright © 2022 by University of Southern Queensland is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Anxiety, Affect, Self-Esteem, and Stress: Mediation and Moderation Effects on Depression

Affiliations Department of Psychology, University of Gothenburg, Gothenburg, Sweden, Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden

Affiliation Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden

Affiliations Department of Psychology, University of Gothenburg, Gothenburg, Sweden, Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden, Department of Psychology, Education and Sport Science, Linneaus University, Kalmar, Sweden

* E-mail: [email protected]

Affiliations Network for Empowerment and Well-Being, University of Gothenburg, Gothenburg, Sweden, Center for Ethics, Law, and Mental Health (CELAM), University of Gothenburg, Gothenburg, Sweden, Institute of Neuroscience and Physiology, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden

  • Ali Al Nima, 
  • Patricia Rosenberg, 
  • Trevor Archer, 
  • Danilo Garcia

PLOS

  • Published: September 9, 2013
  • https://doi.org/10.1371/journal.pone.0073265
  • Reader Comments

23 Sep 2013: Nima AA, Rosenberg P, Archer T, Garcia D (2013) Correction: Anxiety, Affect, Self-Esteem, and Stress: Mediation and Moderation Effects on Depression. PLOS ONE 8(9): 10.1371/annotation/49e2c5c8-e8a8-4011-80fc-02c6724b2acc. https://doi.org/10.1371/annotation/49e2c5c8-e8a8-4011-80fc-02c6724b2acc View correction

Table 1

Mediation analysis investigates whether a variable (i.e., mediator) changes in regard to an independent variable, in turn, affecting a dependent variable. Moderation analysis, on the other hand, investigates whether the statistical interaction between independent variables predict a dependent variable. Although this difference between these two types of analysis is explicit in current literature, there is still confusion with regard to the mediating and moderating effects of different variables on depression. The purpose of this study was to assess the mediating and moderating effects of anxiety, stress, positive affect, and negative affect on depression.

Two hundred and two university students (males  = 93, females  = 113) completed questionnaires assessing anxiety, stress, self-esteem, positive and negative affect, and depression. Mediation and moderation analyses were conducted using techniques based on standard multiple regression and hierarchical regression analyses.

Main Findings

The results indicated that (i) anxiety partially mediated the effects of both stress and self-esteem upon depression, (ii) that stress partially mediated the effects of anxiety and positive affect upon depression, (iii) that stress completely mediated the effects of self-esteem on depression, and (iv) that there was a significant interaction between stress and negative affect, and between positive affect and negative affect upon depression.

The study highlights different research questions that can be investigated depending on whether researchers decide to use the same variables as mediators and/or moderators.

Citation: Nima AA, Rosenberg P, Archer T, Garcia D (2013) Anxiety, Affect, Self-Esteem, and Stress: Mediation and Moderation Effects on Depression. PLoS ONE 8(9): e73265. https://doi.org/10.1371/journal.pone.0073265

Editor: Ben J. Harrison, The University of Melbourne, Australia

Received: February 21, 2013; Accepted: July 22, 2013; Published: September 9, 2013

Copyright: © 2013 Nima et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Mediation refers to the covariance relationships among three variables: an independent variable (1), an assumed mediating variable (2), and a dependent variable (3). Mediation analysis investigates whether the mediating variable accounts for a significant amount of the shared variance between the independent and the dependent variables–the mediator changes in regard to the independent variable, in turn, affecting the dependent one [1] , [2] . On the other hand, moderation refers to the examination of the statistical interaction between independent variables in predicting a dependent variable [1] , [3] . In contrast to the mediator, the moderator is not expected to be correlated with both the independent and the dependent variable–Baron and Kenny [1] actually recommend that it is best if the moderator is not correlated with the independent variable and if the moderator is relatively stable, like a demographic variable (e.g., gender, socio-economic status) or a personality trait (e.g., affectivity).

Although both types of analysis lead to different conclusions [3] and the distinction between statistical procedures is part of the current literature [2] , there is still confusion about the use of moderation and mediation analyses using data pertaining to the prediction of depression. There are, for example, contradictions among studies that investigate mediating and moderating effects of anxiety, stress, self-esteem, and affect on depression. Depression, anxiety and stress are suggested to influence individuals' social relations and activities, work, and studies, as well as compromising decision-making and coping strategies [4] , [5] , [6] . Successfully coping with anxiety, depressiveness, and stressful situations may contribute to high levels of self-esteem and self-confidence, in addition increasing well-being, and psychological and physical health [6] . Thus, it is important to disentangle how these variables are related to each other. However, while some researchers perform mediation analysis with some of the variables mentioned here, other researchers conduct moderation analysis with the same variables. Seldom are both moderation and mediation performed on the same dataset. Before disentangling mediation and moderation effects on depression in the current literature, we briefly present the methodology behind the analysis performed in this study.

Mediation and moderation

Baron and Kenny [1] postulated several criteria for the analysis of a mediating effect: a significant correlation between the independent and the dependent variable, the independent variable must be significantly associated with the mediator, the mediator predicts the dependent variable even when the independent variable is controlled for, and the correlation between the independent and the dependent variable must be eliminated or reduced when the mediator is controlled for. All the criteria is then tested using the Sobel test which shows whether indirect effects are significant or not [1] , [7] . A complete mediating effect occurs when the correlation between the independent and the dependent variable are eliminated when the mediator is controlled for [8] . Analyses of mediation can, for example, help researchers to move beyond answering if high levels of stress lead to high levels of depression. With mediation analysis researchers might instead answer how stress is related to depression.

In contrast to mediation, moderation investigates the unique conditions under which two variables are related [3] . The third variable here, the moderator, is not an intermediate variable in the causal sequence from the independent to the dependent variable. For the analysis of moderation effects, the relation between the independent and dependent variable must be different at different levels of the moderator [3] . Moderators are included in the statistical analysis as an interaction term [1] . When analyzing moderating effects the variables should first be centered (i.e., calculating the mean to become 0 and the standard deviation to become 1) in order to avoid problems with multi-colinearity [8] . Moderating effects can be calculated using multiple hierarchical linear regressions whereby main effects are presented in the first step and interactions in the second step [1] . Analysis of moderation, for example, helps researchers to answer when or under which conditions stress is related to depression.

Mediation and moderation effects on depression

Cognitive vulnerability models suggest that maladaptive self-schema mirroring helplessness and low self-esteem explain the development and maintenance of depression (for a review see [9] ). These cognitive vulnerability factors become activated by negative life events or negative moods [10] and are suggested to interact with environmental stressors to increase risk for depression and other emotional disorders [11] , [10] . In this line of thinking, the experience of stress, low self-esteem, and negative emotions can cause depression, but also be used to explain how (i.e., mediation) and under which conditions (i.e., moderation) specific variables influence depression.

Using mediational analyses to investigate how cognitive therapy intervations reduced depression, researchers have showed that the intervention reduced anxiety, which in turn was responsible for 91% of the reduction in depression [12] . In the same study, reductions in depression, by the intervention, accounted only for 6% of the reduction in anxiety. Thus, anxiety seems to affect depression more than depression affects anxiety and, together with stress, is both a cause of and a powerful mediator influencing depression (See also [13] ). Indeed, there are positive relationships between depression, anxiety and stress in different cultures [14] . Moreover, while some studies show that stress (independent variable) increases anxiety (mediator), which in turn increased depression (dependent variable) [14] , other studies show that stress (moderator) interacts with maladaptive self-schemata (dependent variable) to increase depression (independent variable) [15] , [16] .

The present study

In order to illustrate how mediation and moderation can be used to address different research questions we first focus our attention to anxiety and stress as mediators of different variables that earlier have been shown to be related to depression. Secondly, we use all variables to find which of these variables moderate the effects on depression.

The specific aims of the present study were:

  • To investigate if anxiety mediated the effect of stress, self-esteem, and affect on depression.
  • To investigate if stress mediated the effects of anxiety, self-esteem, and affect on depression.
  • To examine moderation effects between anxiety, stress, self-esteem, and affect on depression.

Ethics statement

This research protocol was approved by the Ethics Committee of the University of Gothenburg and written informed consent was obtained from all the study participants.

Participants

The present study was based upon a sample of 206 participants (males  = 93, females  = 113). All the participants were first year students in different disciplines at two universities in South Sweden. The mean age for the male students was 25.93 years ( SD  = 6.66), and 25.30 years ( SD  = 5.83) for the female students.

In total, 206 questionnaires were distributed to the students. Together 202 questionnaires were responded to leaving a total dropout of 1.94%. This dropout concerned three sections that the participants chose not to respond to at all, and one section that was completed incorrectly. None of these four questionnaires was included in the analyses.

Instruments

Hospital anxiety and depression scale [17] ..

The Swedish translation of this instrument [18] was used to measure anxiety and depression. The instrument consists of 14 statements (7 of which measure depression and 7 measure anxiety) to which participants are asked to respond grade of agreement on a Likert scale (0 to 3). The utility, reliability and validity of the instrument has been shown in multiple studies (e.g., [19] ).

Perceived Stress Scale [20] .

The Swedish version [21] of this instrument was used to measures individuals' experience of stress. The instrument consist of 14 statements to which participants rate on a Likert scale (0 =  never , 4 =  very often ). High values indicate that the individual expresses a high degree of stress.

Rosenberg's Self-Esteem Scale [22] .

The Rosenberg's Self-Esteem Scale (Swedish version by Lindwall [23] ) consists of 10 statements focusing on general feelings toward the self. Participants are asked to report grade of agreement in a four-point Likert scale (1 =  agree not at all, 4 =  agree completely ). This is the most widely used instrument for estimation of self-esteem with high levels of reliability and validity (e.g., [24] , [25] ).

Positive Affect and Negative Affect Schedule [26] .

This is a widely applied instrument for measuring individuals' self-reported mood and feelings. The Swedish version has been used among participants of different ages and occupations (e.g., [27] , [28] , [29] ). The instrument consists of 20 adjectives, 10 positive affect (e.g., proud, strong) and 10 negative affect (e.g., afraid, irritable). The adjectives are rated on a five-point Likert scale (1 =  not at all , 5 =  very much ). The instrument is a reliable, valid, and effective self-report instrument for estimating these two important and independent aspects of mood [26] .

Questionnaires were distributed to the participants on several different locations within the university, including the library and lecture halls. Participants were asked to complete the questionnaire after being informed about the purpose and duration (10–15 minutes) of the study. Participants were also ensured complete anonymity and informed that they could end their participation whenever they liked.

Correlational analysis

Depression showed positive, significant relationships with anxiety, stress and negative affect. Table 1 presents the correlation coefficients, mean values and standard deviations ( sd ), as well as Cronbach ' s α for all the variables in the study.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0073265.t001

Mediation analysis

Regression analyses were performed in order to investigate if anxiety mediated the effect of stress, self-esteem, and affect on depression (aim 1). The first regression showed that stress ( B  = .03, 95% CI [.02,.05], β = .36, t  = 4.32, p <.001), self-esteem ( B  = −.03, 95% CI [−.05, −.01], β = −.24, t  = −3.20, p <.001), and positive affect ( B  = −.02, 95% CI [−.05, −.01], β = −.19, t  = −2.93, p  = .004) had each an unique effect on depression. Surprisingly, negative affect did not predict depression ( p  = 0.77) and was therefore removed from the mediation model, thus not included in further analysis.

The second regression tested whether stress, self-esteem and positive affect uniquely predicted the mediator (i.e., anxiety). Stress was found to be positively associated ( B  = .21, 95% CI [.15,.27], β = .47, t  = 7.35, p <.001), whereas self-esteem was negatively associated ( B  = −.29, 95% CI [−.38, −.21], β = −.42, t  = −6.48, p <.001) to anxiety. Positive affect, however, was not associated to anxiety ( p  = .50) and was therefore removed from further analysis.

A hierarchical regression analysis using depression as the outcome variable was performed using stress and self-esteem as predictors in the first step, and anxiety as predictor in the second step. This analysis allows the examination of whether stress and self-esteem predict depression and if this relation is weaken in the presence of anxiety as the mediator. The result indicated that, in the first step, both stress ( B  = .04, 95% CI [.03,.05], β = .45, t  = 6.43, p <.001) and self-esteem ( B  = .04, 95% CI [.03,.05], β = .45, t  = 6.43, p <.001) predicted depression. When anxiety (i.e., the mediator) was controlled for predictability was reduced somewhat but was still significant for stress ( B  = .03, 95% CI [.02,.04], β = .33, t  = 4.29, p <.001) and for self-esteem ( B  = −.03, 95% CI [−.05, −.01], β = −.20, t  = −2.62, p  = .009). Anxiety, as a mediator, predicted depression even when both stress and self-esteem were controlled for ( B  = .05, 95% CI [.02,.08], β = .26, t  = 3.17, p  = .002). Anxiety improved the prediction of depression over-and-above the independent variables (i.e., stress and self-esteem) (Δ R 2  = .03, F (1, 198) = 10.06, p  = .002). See Table 2 for the details.

thumbnail

https://doi.org/10.1371/journal.pone.0073265.t002

A Sobel test was conducted to test the mediating criteria and to assess whether indirect effects were significant or not. The result showed that the complete pathway from stress (independent variable) to anxiety (mediator) to depression (dependent variable) was significant ( z  = 2.89, p  = .003). The complete pathway from self-esteem (independent variable) to anxiety (mediator) to depression (dependent variable) was also significant ( z  = 2.82, p  = .004). Thus, indicating that anxiety partially mediates the effects of both stress and self-esteem on depression. This result may indicate also that both stress and self-esteem contribute directly to explain the variation in depression and indirectly via experienced level of anxiety (see Figure 1 ).

thumbnail

Changes in Beta weights when the mediator is present are highlighted in red.

https://doi.org/10.1371/journal.pone.0073265.g001

For the second aim, regression analyses were performed in order to test if stress mediated the effect of anxiety, self-esteem, and affect on depression. The first regression showed that anxiety ( B  = .07, 95% CI [.04,.10], β = .37, t  = 4.57, p <.001), self-esteem ( B  = −.02, 95% CI [−.05, −.01], β = −.18, t  = −2.23, p  = .03), and positive affect ( B  = −.03, 95% CI [−.04, −.02], β = −.27, t  = −4.35, p <.001) predicted depression independently of each other. Negative affect did not predict depression ( p  = 0.74) and was therefore removed from further analysis.

The second regression investigated if anxiety, self-esteem and positive affect uniquely predicted the mediator (i.e., stress). Stress was positively associated to anxiety ( B  = 1.01, 95% CI [.75, 1.30], β = .46, t  = 7.35, p <.001), negatively associated to self-esteem ( B  = −.30, 95% CI [−.50, −.01], β = −.19, t  = −2.90, p  = .004), and a negatively associated to positive affect ( B  = −.33, 95% CI [−.46, −.20], β = −.27, t  = −5.02, p <.001).

A hierarchical regression analysis using depression as the outcome and anxiety, self-esteem, and positive affect as the predictors in the first step, and stress as the predictor in the second step, allowed the examination of whether anxiety, self-esteem and positive affect predicted depression and if this association would weaken when stress (i.e., the mediator) was present. In the first step of the regression anxiety ( B  = .07, 95% CI [.05,.10], β = .38, t  = 5.31, p  = .02), self-esteem ( B  = −.03, 95% CI [−.05, −.01], β = −.18, t  = −2.41, p  = .02), and positive affect ( B  = −.03, 95% CI [−.04, −.02], β = −.27, t  = −4.36, p <.001) significantly explained depression. When stress (i.e., the mediator) was controlled for, predictability was reduced somewhat but was still significant for anxiety ( B  = .05, 95% CI [.02,.08], β = .05, t  = 4.29, p <.001) and for positive affect ( B  = −.02, 95% CI [−.04, −.01], β = −.20, t  = −3.16, p  = .002), whereas self-esteem did not reach significance ( p < = .08). In the second step, the mediator (i.e., stress) predicted depression even when anxiety, self-esteem, and positive affect were controlled for ( B  = .02, 95% CI [.08,.04], β = .25, t  = 3.07, p  = .002). Stress improved the prediction of depression over-and-above the independent variables (i.e., anxiety, self-esteem and positive affect) (Δ R 2  = .02, F (1, 197)  = 9.40, p  = .002). See Table 3 for the details.

thumbnail

https://doi.org/10.1371/journal.pone.0073265.t003

Furthermore, the Sobel test indicated that the complete pathways from the independent variables (anxiety: z  = 2.81, p  = .004; self-esteem: z  =  2.05, p  = .04; positive affect: z  = 2.58, p <.01) to the mediator (i.e., stress), to the outcome (i.e., depression) were significant. These specific results might be explained on the basis that stress partially mediated the effects of both anxiety and positive affect on depression while stress completely mediated the effects of self-esteem on depression. In other words, anxiety and positive affect contributed directly to explain the variation in depression and indirectly via the experienced level of stress. Self-esteem contributed only indirectly via the experienced level of stress to explain the variation in depression. In other words, stress effects on depression originate from “its own power” and explained more of the variation in depression than self-esteem (see Figure 2 ).

thumbnail

https://doi.org/10.1371/journal.pone.0073265.g002

Moderation analysis

Multiple linear regression analyses were used in order to examine moderation effects between anxiety, stress, self-esteem and affect on depression. The analysis indicated that about 52% of the variation in the dependent variable (i.e., depression) could be explained by the main effects and the interaction effects ( R 2  = .55, adjusted R 2  = .51, F (55, 186)  = 14.87, p <.001). When the variables (dependent and independent) were standardized, both the standardized regression coefficients beta (β) and the unstandardized regression coefficients beta (B) became the same value with regard to the main effects. Three of the main effects were significant and contributed uniquely to high levels of depression: anxiety ( B  = .26, t  = 3.12, p  = .002), stress ( B  = .25, t  = 2.86, p  = .005), and self-esteem ( B  = −.17, t  = −2.17, p  = .03). The main effect of positive affect was also significant and contributed to low levels of depression ( B  = −.16, t  = −2.027, p  = .02) (see Figure 3 ). Furthermore, the results indicated that two moderator effects were significant. These were the interaction between stress and negative affect ( B  = −.28, β = −.39, t  = −2.36, p  = .02) (see Figure 4 ) and the interaction between positive affect and negative affect ( B  = −.21, β = −.29, t  = −2.30, p  = .02) ( Figure 5 ).

thumbnail

https://doi.org/10.1371/journal.pone.0073265.g003

thumbnail

Low stress and low negative affect leads to lower levels of depression compared to high stress and high negative affect.

https://doi.org/10.1371/journal.pone.0073265.g004

thumbnail

High positive affect and low negative affect lead to lower levels of depression compared to low positive affect and high negative affect.

https://doi.org/10.1371/journal.pone.0073265.g005

The results in the present study show that (i) anxiety partially mediated the effects of both stress and self-esteem on depression, (ii) that stress partially mediated the effects of anxiety and positive affect on depression, (iii) that stress completely mediated the effects of self-esteem on depression, and (iv) that there was a significant interaction between stress and negative affect, and positive affect and negative affect on depression.

Mediating effects

The study suggests that anxiety contributes directly to explaining the variance in depression while stress and self-esteem might contribute directly to explaining the variance in depression and indirectly by increasing feelings of anxiety. Indeed, individuals who experience stress over a long period of time are susceptible to increased anxiety and depression [30] , [31] and previous research shows that high self-esteem seems to buffer against anxiety and depression [32] , [33] . The study also showed that stress partially mediated the effects of both anxiety and positive affect on depression and that stress completely mediated the effects of self-esteem on depression. Anxiety and positive affect contributed directly to explain the variation in depression and indirectly to the experienced level of stress. Self-esteem contributed only indirectly via the experienced level of stress to explain the variation in depression, i.e. stress affects depression on the basis of ‘its own power’ and explains much more of the variation in depressive experiences than self-esteem. In general, individuals who experience low anxiety and frequently experience positive affect seem to experience low stress, which might reduce their levels of depression. Academic stress, for instance, may increase the risk for experiencing depression among students [34] . Although self-esteem did not emerged as an important variable here, under circumstances in which difficulties in life become chronic, some researchers suggest that low self-esteem facilitates the experience of stress [35] .

Moderator effects/interaction effects

The present study showed that the interaction between stress and negative affect and between positive and negative affect influenced self-reported depression symptoms. Moderation effects between stress and negative affect imply that the students experiencing low levels of stress and low negative affect reported lower levels of depression than those who experience high levels of stress and high negative affect. This result confirms earlier findings that underline the strong positive association between negative affect and both stress and depression [36] , [37] . Nevertheless, negative affect by itself did not predicted depression. In this regard, it is important to point out that the absence of positive emotions is a better predictor of morbidity than the presence of negative emotions [38] , [39] . A modification to this statement, as illustrated by the results discussed next, could be that the presence of negative emotions in conjunction with the absence of positive emotions increases morbidity.

The moderating effects between positive and negative affect on the experience of depression imply that the students experiencing high levels of positive affect and low levels of negative affect reported lower levels of depression than those who experience low levels of positive affect and high levels of negative affect. This result fits previous observations indicating that different combinations of these affect dimensions are related to different measures of physical and mental health and well-being, such as, blood pressure, depression, quality of sleep, anxiety, life satisfaction, psychological well-being, and self-regulation [40] – [51] .

Limitations

The result indicated a relatively low mean value for depression ( M  = 3.69), perhaps because the studied population was university students. These might limit the generalization power of the results and might also explain why negative affect, commonly associated to depression, was not related to depression in the present study. Moreover, there is a potential influence of single source/single method variance on the findings, especially given the high correlation between all the variables under examination.

Conclusions

The present study highlights different results that could be arrived depending on whether researchers decide to use variables as mediators or moderators. For example, when using meditational analyses, anxiety and stress seem to be important factors that explain how the different variables used here influence depression–increases in anxiety and stress by any other factor seem to lead to increases in depression. In contrast, when moderation analyses were used, the interaction of stress and affect predicted depression and the interaction of both affectivity dimensions (i.e., positive and negative affect) also predicted depression–stress might increase depression under the condition that the individual is high in negative affectivity, in turn, negative affectivity might increase depression under the condition that the individual experiences low positive affectivity.

Acknowledgments

The authors would like to thank the reviewers for their openness and suggestions, which significantly improved the article.

Author Contributions

Conceived and designed the experiments: AAN TA. Performed the experiments: AAN. Analyzed the data: AAN DG. Contributed reagents/materials/analysis tools: AAN TA DG. Wrote the paper: AAN PR TA DG.

  • View Article
  • Google Scholar
  • 3. MacKinnon DP, Luecken LJ (2008) How and for Whom? Mediation and Moderation in Health Psychology. Health Psychol 27 (2 Suppl.): s99–s102.
  • 4. Aaroe R (2006) Vinn över din depression [Defeat depression]. Stockholm: Liber.
  • 5. Agerberg M (1998) Ut ur mörkret [Out from the Darkness]. Stockholm: Nordstedt.
  • 6. Gilbert P (2005) Hantera din depression [Cope with your Depression]. Stockholm: Bokförlaget Prisma.
  • 8. Tabachnick BG, Fidell LS (2007) Using Multivariate Statistics, Fifth Edition. Boston: Pearson Education, Inc.
  • 10. Beck AT (1967) Depression: Causes and treatment. Philadelphia: University of Pennsylvania Press.
  • 21. Eskin M, Parr D (1996) Introducing a Swedish version of an instrument measuring mental stress. Stockholm: Psykologiska institutionen Stockholms Universitet.
  • 22. Rosenberg M (1965) Society and the Adolescent Self-Image. Princeton, NJ: Princeton University Press.
  • 23. Lindwall M (2011) Självkänsla – Bortom populärpsykologi & enkla sanningar [Self-Esteem – Beyond Popular Psychology and Simple Truths]. Lund:Studentlitteratur.
  • 25. Blascovich J, Tomaka J (1991) Measures of self-esteem. In: Robinson JP, Shaver PR, Wrightsman LS (Red.) Measures of personality and social psychological attitudes San Diego: Academic Press. 161–194.
  • 30. Eysenck M (Ed.) (2000) Psychology: an integrated approach. New York: Oxford University Press.
  • 31. Lazarus RS, Folkman S (1984) Stress, Appraisal, and Coping. New York: Springer.
  • 32. Johnson M (2003) Självkänsla och anpassning [Self-esteem and Adaptation]. Lund: Studentlitteratur.
  • 33. Cullberg Weston M (2005) Ditt inre centrum – Om självkänsla, självbild och konturen av ditt själv [Your Inner Centre – About Self-esteem, Self-image and the Contours of Yourself]. Stockholm: Natur och Kultur.
  • 34. Lindén M (1997) Studentens livssituation. Frihet, sårbarhet, kris och utveckling [Students' Life Situation. Freedom, Vulnerability, Crisis and Development]. Uppsala: Studenthälsan.
  • 35. Williams S (1995) Press utan stress ger maximal prestation [Pressure without Stress gives Maximal Performance]. Malmö: Richters förlag.
  • 37. Garcia D, Kerekes N, Andersson-Arntén A–C, Archer T (2012) Temperament, Character, and Adolescents' Depressive Symptoms: Focusing on Affect. Depress Res Treat. DOI:10.1155/2012/925372.
  • 40. Garcia D, Ghiabi B, Moradi S, Siddiqui A, Archer T (2013) The Happy Personality: A Tale of Two Philosophies. In Morris EF, Jackson M-A editors. Psychology of Personality. New York: Nova Science Publishers. 41–59.
  • 41. Schütz E, Nima AA, Sailer U, Andersson-Arntén A–C, Archer T, Garcia D (2013) The affective profiles in the USA: Happiness, depression, life satisfaction, and happiness-increasing strategies. In press.
  • 43. Garcia D, Nima AA, Archer T (2013) Temperament and Character's Relationship to Subjective Well- Being in Salvadorian Adolescents and Young Adults. In press.
  • 44. Garcia D (2013) La vie en Rose: High Levels of Well-Being and Events Inside and Outside Autobiographical Memory. J Happiness Stud. DOI: 10.1007/s10902-013-9443-x.
  • 48. Adrianson L, Djumaludin A, Neila R, Archer T (2013) Cultural influences upon health, affect, self-esteem and impulsiveness: An Indonesian-Swedish comparison. Int J Res Stud Psychol. DOI: 10.5861/ijrsp.2013.228.

Home

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.

To get started, let's read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:

  • GEN, gender (male = 0, female = 1)
  • AMT, amount of drug taken at time of overdose
  • PR, PR wave measurement
  • DIAP, diastolic blood pressure
  • QRS, QRS wave measurement

We'll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.

Before going further you may wish to explore the data using the summary() and pairs() functions.

Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and "binds" them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition but rather inclusion. Taken together the formula cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS translates to "model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS." To fit this model we use the workhorse lm() function and save it to an object we name "mlm1". Finally we view the results with summary() .

Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if we modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They're identical.

The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis .

We can use R's extractor functions with our mlm1 object, except we'll get double the output. For example, instead of one set of residuals, we get two:

Instead of one set of fitted values, we get two:

Instead of one set of coefficients, we get two:

Instead of one residual standard error, we get two:

Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don't reproduce the output here because of the size, but we encourage you to view it for yourself:

The main takeaway is that the coefficients from both models covary . That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it's not enough to eyeball the results from the two separate regressions. We should formally test for their inclusion. And that test involves the covariances between the coefficients in both models.

Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term "multivariate" here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.

The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:

The results are titled "Type II MANOVA Tests". The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. "Type II" refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.

Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little "a"; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression . ~ . - PR - DIAP - QRS says "keep the same responses and predictors except PR, DIAP and QRS."

The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is "Pillai", which is one of the four common multivariate test statistics.

The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.

The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).

Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is $$ \frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}} $$

In R we can calculate that as follows:

Likewise the formula for Pillai is $$ tr[\bf{H}(\bf{H} + \bf{E})^{-1}] $$ tr means trace. That's the sum of the diagonal elements of a matrix. In R we can calculate as follows:

The formula for Hotelling-Lawley is $$ tr[\bf{H}\bf{E}^{-1}] $$ In R:

And finally the Roy statistics is the largest eigenvalue of \(\bf{H}\bf{E}^{-1}\). In R code:

Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary:

Now let's say we wanted to use this model to estimate mean TOT and AMI values for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data.

This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don't calculate an interval but rather an ellipse to capture the uncertainty in two dimensions.

Unfortunately at the time of this writing there doesn't appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However, we have written one below you can use called confidenceEllipse() . The details of the function go beyond a "getting started" blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics.

Here's a demonstration of the function.

Plot of predicted value for TOT and AMI for model mlm2 with a 95% confidence ellipse.

The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We're 95% confident the true mean values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa.

  • Fox, J and Weisberg, S (2011). An {R} Companion to Applied Regression, Second Edition . Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
  • Johnson, R and Wichern, D (2007). Applied Multivariate Statistical Analysis, Sixth Edition . Prentice-Hall.
  • Rudorfer, MV "Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose." Journal of Toxicology-Clinical Toxicology , 19 (1982), 67-71.

Clay Ford Statistical Research Consultant University of Virginia Library October 27, 2017 Updated May 26, 2023 Update February 20, 2024 (changed function name)

For questions or clarifications regarding this article, contact  [email protected] .

View the entire collection  of UVA Library StatLab articles, or learn how to cite .

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!

Related categories:

Multiple Regression Analysis using SPSS Statistics

Introduction.

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.

Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:

  • Assumption #1: Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable . If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").
  • Assumption #2: You have two or more independent variables , which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable). For examples of continuous and ordinal variables , see the bullet above. Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Again, you can learn more about variables in our article: Types of Variable . If one of your independent variables is dichotomous and considered a moderating variable, you might need to run a Dichotomous moderator analysis .
  • Assumption #3: You should have independence of observations (i.e., independence of residuals ), which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required, in our enhanced multiple regression guide.
  • Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively . Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots and partial regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots to check for linearity. If the relationship displayed in your scatterplots and partial regression plots are not linear, you will have to either run a non-linear regression analysis or "transform" your data, which you can do using SPSS Statistics. In our enhanced multiple regression guide, we show you how to: (a) create scatterplots and partial regression plots to check for linearity when carrying out multiple regression using SPSS Statistics; (b) interpret different scatterplot and partial regression plot results; and (c) transform your data using SPSS Statistics if you do not have linear relationships between your variables.
  • Assumption #5: Your data needs to show homoscedasticity , which is where the variances along the line of best fit remain similar as you move along the line. We explain more about what this means and how to assess the homoscedasticity of your data in our enhanced multiple regression guide. When you analyse your own data, you will need to plot the studentized residuals against the unstandardized predicted values. In our enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity using SPSS Statistics; (b) some of the things you will need to consider when interpreting your data; and (c) possible ways to continue with your analysis if your data fails to meet this assumption.
  • Assumption #6: Your data must not show multicollinearity , which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model. Therefore, in our enhanced multiple regression guide, we show you: (a) how to use SPSS Statistics to detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret these correlation coefficients and Tolerance/VIF values so that you can determine whether your data meets or violates this assumption.
  • Assumption #7: There should be no significant outliers , high leverage points or highly influential points . Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your data, you can detect possible outliers, high leverage points and highly influential points. In our enhanced multiple regression guide, we: (a) show you how to detect outliers using "casewise diagnostics" and "studentized deleted residuals", which you can do using SPSS Statistics, and discuss some of the options you have in order to deal with outliers; (b) check for leverage points using SPSS Statistics and discuss what you should do if you have any; and (c) check for influential points in SPSS Statistics using a measure of influence known as Cook's Distance, before presenting some practical approaches in SPSS Statistics to deal with any influential points you might have.
  • Assumption #8: Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using: (a) a histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.

Setup in SPSS Statistics

In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a multiple regression analysis in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.

'Linear Regression' dialogue box for a multiple regression analysis in SPSS Statistics. All variables on the left

Interpreting and Reporting the Output of Multiple Regression Analysis

SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.

However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:

Determining how well the model fits

The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:

'Model Summary' table for a multiple regression analysis in SPSS. 'R', 'R Square' & 'Adjusted R Square' highlighted

The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.

Statistical significance

The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).

'ANOVA' table for a multiple regression analysis in SPSS Statistics. 'df', 'F' & 'Sig.' highlighted

Estimated model coefficients

The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:

predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )

This is obtained from the Coefficients table, as shown below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 'Unstandardized Coefficients B' highlighted

Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.

Statistical significance of the independent variables

You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 't' & 'Sig.' highlighted

You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.

Putting it all together

You could write up the results as follows:

A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

  • Search Search Please fill out this field.
  • What Is Multiple Linear Regression?
  • Formula and Calculation
  • What It Can Tell You
  • Linear vs. Multiple Regression
  • Multiple Regression FAQs
  • Corporate Finance
  • Financial Analysis

Multiple Linear Regression (MLR) Definition, Formula, and Example

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

example of research using multiple regression analysis

Investopedia / Nez Riaz

What Is Multiple Linear Regression (MLR)?

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable.

Key Takeaways

  • Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
  • Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
  • MLR is used extensively in econometrics and financial inference.

Formula and Calculation of Multiple Linear Regression

 y i = β 0 + β 1 x i 1 + β 2 x i 2 + . . . + β p x i p + ϵ where,   for   i = n   observations: y i = dependent variable x i = explanatory variables β 0 = y-intercept (constant term) β p = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals) \begin{aligned}&y_i = \beta_0 + \beta _1 x_{i1} + \beta _2 x_{i2} + ... + \beta _p x_{ip} + \epsilon\\&\textbf{where, for } i = n \textbf{ observations:}\\&y_i=\text{dependent variable}\\&x_i=\text{explanatory variables}\\&\beta_0=\text{y-intercept (constant term)}\\&\beta_p=\text{slope coefficients for each explanatory variable}\\&\epsilon=\text{the model's error term (also known as the residuals)}\end{aligned} ​ y i ​ = β 0 ​ + β 1 ​ x i 1 ​ + β 2 ​ x i 2 ​ + . . . + β p ​ x i p ​ + ϵ where, for  i = n  observations: y i ​ = dependent variable x i ​ = explanatory variables β 0 ​ = y-intercept (constant term) β p ​ = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals) ​ 

What Multiple Linear Regression Can Tell You

Simple linear regression is a function that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables—an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.

The multiple regression model is based on the following assumptions:

  • There is a linear relationship between the dependent variables and the independent variables
  • The independent variables are not too highly correlated with each other
  • y i observations are selected independently and randomly from the population
  • Residuals should be normally distributed with a mean of 0 and variance σ

The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the variation in outcome can be explained by the variation in the independent variables. R 2 always increases as more predictors are added to the MLR model, even though the predictors may not be related to the outcome variable.

R 2 by itself can't thus be used to identify which predictors should be included in a model and which should be excluded. R 2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the independent variables and 1 indicates that the outcome can be predicted without error from the independent variables.

When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation, or vertically in table form.

Example of How to Use Multiple Linear Regression

As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil (XOM). In this case, their linear equation will have the value of the S&P 500 index as the independent variable, or predictor, and the price of XOM as the dependent variable.

In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for example, depends on more than just the performance of the overall market. Other predictors such as the price of oil, interest rates, and the price movement of oil futures can affect the price of Exon Mobil ( XOM ) and the stock prices of other oil companies. To understand a relationship in which more than two variables are present, multiple linear regression is used.

Multiple linear regression (MLR) is used to determine a mathematical relationship among several random variables. In other terms, MLR examines how multiple independent variables are related to one dependent variable. Once each of the independent factors has been determined to predict the dependent variable, the information on the multiple variables can be used to create an accurate prediction on the level of effect they have on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points.

Referring to the MLR equation above, in our example:

  • y i = dependent variable—the price of XOM
  • x i1 = interest rates
  • x i2 = oil price
  • x i3 = value of S&P 500 index
  • x i4 = price of oil futures
  • B 0 = y-intercept at time zero
  • B 1 = regression coefficient that measures a unit change in the dependent variable when x i1 changes - the change in XOM price when interest rates change
  • B 2 = coefficient value that measures a unit change in the dependent variable when x i2 changes—the change in XOM price when oil prices change

The least-squares estimates—B 0 , B 1 , B 2 …B p —are usually computed by statistical software. As many variables can be included in the regression model in which each independent variable is differentiated with a number—1,2, 3, 4...p. The multiple regression model allows an analyst to predict an outcome based on information provided on multiple explanatory variables.

Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome predicted by the model. The residual value, E, which is the difference between the actual outcome and the predicted outcome, is included in the model to account for such slight variations.

Assuming we run our XOM price regression model through a statistics computation software, that returns this output:

An analyst would interpret this output to mean if other variables are held constant, the price of XOM will increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows that the price of XOM will decrease by 1.5% following a 1% rise in interest rates. R 2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained by changes in the interest rate, oil price, oil futures, and S&P 500 index.

The Difference Between Linear and Multiple Regression

Ordinary linear squares (OLS) regression compares the response of a dependent variable given a change in some explanatory variables. However, a dependent variable is rarely explained by only one variable. In this case, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. Multiple regressions can be linear and nonlinear.

Multiple regressions are based on the assumption that there is a linear relationship between both the dependent and independent variables. It also assumes no major correlation between the independent variables.

What Makes a Multiple Regression Multiple?

A multiple regression considers the effect of more than one explanatory variable on some outcome of interest. It evaluates the relative effect of these explanatory, or independent, variables on the dependent variable when holding all the other variables in the model constant.

Why Would One Use a Multiple Regression Over a Simple OLS Regression?

A dependent variable is rarely explained by only one variable. In such cases, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. The model, however, assumes that there are no major correlations between the independent variables.

Can I Do a Multiple Regression by Hand?

It's unlikely as multiple regression models are complex and become even more so when there are more variables included in the model or when the amount of data to analyze grows. To run a multiple regression you will likely need to use specialized statistical software or functions within programs like Excel.

What Does It Mean for a Multiple Regression to Be Linear?

In multiple linear regression, the model calculates the line of best fit that minimizes the variances of each of the variables included as it relates to the dependent variable. Because it fits a line, it is a linear model. There are also non-linear regression models involving multiple variables, such as logistic regression, quadratic regression, and probit models.

How Are Multiple Regression Models Used in Finance?

Any econometric model that looks at more than one variable may be a multiple. Factor models compare two or more factors to analyze relationships between variables and the resulting performance. The Fama and French Three-Factor Mod is such a model that expands on the capital asset pricing model (CAPM) by adding size risk and value risk factors to the market risk factor in CAPM (which is itself a regression model). By including these two additional factors, the model adjusts for this outperforming tendency, which is thought to make it a better tool for evaluating manager performance.

Boston University Medical Campus-School of Public Health. " Multiple Linear Regression ."

example of research using multiple regression analysis

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

Multiple Regression Model

  • First Online: 15 June 2024

Cite this chapter

example of research using multiple regression analysis

  • Hideki Toyoda 2  

We discuss regression analysis with multiple predictor variables, detailing especially the case with two predictor variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Yanai, Takagi (Eds.) (1986) Handbook of Multivariate Analysis, Contemporary Mathematics Corporation.

John Kruschke (2014) Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan. Academic Press. Kazuhisa Maeda, Koji Kosugi (Trans.) (2017) Bayesian Statistical Modeling: A Tutorial with R, JAGS, and Stan, 2nd Edition, Kyoritsu Shuppan. Chapter 12 .

Author information

Authors and affiliations.

Department of Psychology, Waseda University, Shinjyuku-ku, Tokyo, Japan

Hideki Toyoda

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hideki Toyoda .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Toyoda, H. (2024). Multiple Regression Model. In: Statistics with Posterior Probability and a PHC Curve. Springer, Singapore. https://doi.org/10.1007/978-981-97-3094-0_15

Download citation

DOI : https://doi.org/10.1007/978-981-97-3094-0_15

Published : 15 June 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-3093-3

Online ISBN : 978-981-97-3094-0

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

SPSS tutorials website header logo

SPSS Multiple Linear Regression Example

A newly updated, ad-free video version of this tutorial is included in our SPSS beginners course .

Multiple Regression - Example

Data checks and descriptive statistics, spss regression dialogs, spss multiple regression output, multiple regression assumptions, apa reporting multiple regression.

A scientist wants to know if and how health care costs can be predicted from several patient characteristics. All data are in health-costs.sav as shown below.

SPSS Multiple Linear Regression Example Data

Our scientist thinks that each independent variable has a linear relation with health care costs. He therefore decides to fit a multiple linear regression model. The final model will predict costs from all independent variables simultaneously.

Before running multiple regression, first make sure that

  • the dependent variable is quantitative;
  • each independent variable is quantitative or dichotomous;
  • you have sufficient sample size.

A visual inspection of our data shows that requirements 1 and 2 are met: sex is a dichotomous variable and all other relevant variables are quantitative. Regarding sample size, a general rule of thumb is that you want to use at least 15 independent observations for each independent variable you'll include. In our example, we'll use 5 independent variables so we need a sample size of at least N = (5 · 15 =) 75 cases. Our data contain 525 cases so this seems fine.

SPSS Multiple Linear Regression Check Sample Size

Keep in mind, however, that we may not be able to use all N = 525 cases if there's any missing values in our variables.

Let's now proceed with some quick data checks. I strongly encourage you to at least

  • run basic histograms over all variables. Check if their frequency distributions look plausible. Are there any outliers? Should you specify any missing values?
  • inspect a scatterplot for each independent variable (x-axis) versus the dependent variable (y-axis). A handy tool for doing just that is downloadable from SPSS - Create All Scatterplots Tool . Do you see any curvilinear relations or anything unusual?
  • run descriptive statistics over all variables. Inspect if any variables have any missing values and -if so- how many.
  • inspect the Pearson correlations among all variables. Absolute correlations exceeding 0.8 or so may later cause complications (known as multicollinearity) for the actual regression analysis.

The APA recommends you combine and report these last two tables as shown below.

Apa Descriptive Statistics Correlations Table

These data checks show that our example data look perfectly fine: all charts are plausible, there's no missing values and none of the correlations exceed 0.43. Let's now proceed with the actual regression analysis.

SPSS Menu Arrow

Next, we fill out the main dialog and subdialogs as shown below.

SPSS Multiple Linear Regression Dialogs

SPSS Multiple Regression Syntax I

The first table we inspect is the Coefficients table shown below.

SPSS Multiple Regression Coefficients Table

$$Costs' = -3263.6 + 509.3 \cdot Sex + 114.7 \cdot Age + 50.4 \cdot Alcohol\\ + 139.4 \cdot Cigarettes - 271.3 \cdot Exericse$$

where \(Costs'\) denotes predicted yearly health care costs in dollars.

Each b-coefficient indicates the average increase in costs associated with a 1-unit increase in a predictor. For example, a 1-year increase in age results in an average $114.7 increase in costs. Or a 1 hour increase in exercise per week is associated with a -$271.3 increase (that is, a $271.3 decrease) in yearly health costs.

Now, let's talk about sex : a 1-unit increase in sex results in an average $509.3 increase in costs. For understanding what this means, please note that sex is coded 0 (female) and 1 (male) in our example data. So for this variable, the only possible 1-unit increase is from female (0) to male (1). Therefore, B = $509.3 simply means that the average yearly costs for males are $509.3 higher than for females (everything else equal, that is). This hopefully clarifies how dichotomous variables can be used in multiple regression. We'll expand on this idea when we'll cover dummy variables in a later tutorial.

example of research using multiple regression analysis

Now, our b-coefficients don't tell us the relative strengths of our predictors. This is because these have different scales: is a cigarette per day more or less than an alcoholic beverage per week? One way to deal with this, is to compare the standardized regression coefficients or beta coefficients, often denoted as β (the Greek letter “beta”). In statistics, β also refers to the probability of committing a type II error in hypothesis testing. This is why (1 - β ) denotes power but that's a completely different topic than regression coefficients.

  • age ( β = 0.322);
  • cigarette consumption ( β = 0.311);
  • exercise ( β = -0.281).

Beta coefficients are obtained by standardizing all regression variables into z-scores before computing b-coefficients. Standardizing variables applies a similar standard (or scale ) to them: the resulting z-scores always have mean of 0 and a standard deviation of 1. This holds regardless whether they're computed over years, cigarettes or alcoholic beverages. So that's why b-coefficients computed over standardized variables -beta coefficients- are comparable within and between regression models.

Right, so our b-coefficients make up our multiple regression model. This tells us how to predict yearly health care costs. What we don't know, however, is precisely how well does our model predict these costs? We'll find the answer in the model summary table discussed below.

SPSS Regression Output II - Model Summary & ANOVA

example of research using multiple regression analysis

Sadly, SPSS doesn't include a confidence interval for R 2 adj . However, the p-value found in the ANOVA table applies to R and R-square (the rest of this table is pretty useless). It evaluates the null hypothesis that our entire regression model has a population R of zero. Since p < 0.05, we reject this null hypothesis for our example data.

It seems we're done for this analysis but we skipped an important step: checking the multiple regression assumptions.

Our data checks started off with some basic requirements. However, the “official” multiple linear regression assumptions are

  • independent observations ;
  • normality : the regression residuals must be normally distributed in the population Strictly, we should distinguish between residuals (sample) and errors (population). For now, however, let's not overcomplicate things. ;
  • homoscedasticity : the population variance of the residuals should not fluctuate in any systematic way;
  • linearity : each predictor must have a linear relation with the dependent variable.

We'll check if our example analysis meets these assumptions by doing 3 things:

  • A visual inspection of our data shows that each of our N = 525 observations applies to a different person. Furthermore, these people did not interact in any way that should influence their survey answers. In this case, we usually consider them independent observations.
  • We'll create and inspect a histogram of our regression residuals to see if they are approximately normally distributed .
  • We'll create and inspect a scatterplot of residuals (y-axis) versus predicted values (x-axis). This scatterplot may detect violations of both homoscedasticity and linearity.

The easy way to obtain these 2 regression plots, is selecting them in the dialogs (shown below) and rerunning the regression analysis.

SPSS Multiple Regression Plots Subdialog

SPSS Multiple Regression Syntax II

Residual plots i - histogram.

SPSS Histogram Standardized Regression Residuals

The histogram over our standardized residuals shows

  • a tiny bit of positive skewness ; the right tail of the distribution is stretched out a bit.
  • a tiny bit of positive kurtosis; our distribution is more peaked (or “leptokurtic”) than the normal curve. This is because the bars in the middle are too high and pierce through the normal curve.

In short, we do see some deviations from normality but they're tiny. Most analysts would conclude that the residuals are roughly normally distributed. If you're not convinced, you could add the residuals as a new variable to the data via the SPSS regression dialogs. Next, you could run a Shapiro-Wilk test or a Kolmogorov-Smirnov test on them. However, we don't generally recommend these tests.

Residual Plots II - Scatterplot

The residual scatterplot shown below is often used for checking a) the homoscedasticity and b) the linearity assumptions. If both assumptions hold, this scatterplot shouldn't show any systematic pattern whatsoever. That seems to be the case here.

Regression Plot Residuals Versus Predicted Values

Homoscedasticity implies that the variance of the residuals should be constant. This variance can be estimated from how far the dots in our scatterplot lie apart vertically . Therefore, the height of our scatterplot should neither increase nor decrease as we move from left to right. We don't see any such pattern.

A common check for the linearity assumption is inspecting if the dots in this scatterplot show any kind of curve. That's not the case here so linearity also seems to hold here. On a personal note, however, I find this a very weak approach. An unusual (but much stronger) approach is to fit a variety of non linear regression models for each predictor separately. Doing so requires very little effort and often reveils non linearity. This can then be added to some linear model in order to improve its predictive accuracy. Sadly, this “low hanging fruit” is routinely overlooked because analysts usually limit themselves to the poor scatterplot aproach that we just discussed.

The APA reporting guidelines propose the table shown below for reporting a standard multiple regression analysis.

Apa Reporting Multiple Linear Regression

I think it's utter stupidity that the APA table doesn't include the constant for our regression model. I recommend you add it anyway. Furthermore, note that

in the SPSS output. Last, the APA also recommends reporting a combined descriptive statistics and correlations table like we saw here .

Thanks for reading!

Tell us what you think!

This tutorial has 74 comments:.

example of research using multiple regression analysis

By Emma on July 1st, 2023

In this analysis, we only deal with some numerical variables, but for categorical variables such as: region; marital et al did not participate in the analysis, so why don't we create dummy variables, put new variables into the analysis, and finally get results involving all relevant independent variables?

example of research using multiple regression analysis

By Ruben Geert van den Berg on July 1st, 2023

We didn't do that because the tutorial would become way too long like that.

Besides, multiple regression is often limited to quantitative variables and we don't want to discuss every imaginable sub-topic in a single tutorial.

We discussed the use of dummy variables in SPSS Dummy Variable Regression Tutorial .

Hope that helps!

SPSS tutorials

example of research using multiple regression analysis

By Onour Osham Omer on January 17th, 2024

very helpful tutorial

example of research using multiple regression analysis

By Habtamu on May 18th, 2024

It is very helpful.

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

example of research using multiple regression analysis

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 24, Issue 4
  • Understanding and interpreting regression analysis
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0002-7839-8130 Parveen Ali 1 , 2 ,
  • http://orcid.org/0000-0003-0157-5319 Ahtisham Younas 3 , 4
  • 1 School of Nursing and Midwifery , University of Sheffield , Sheffield , South Yorkshire , UK
  • 2 Sheffiled University Interpersonal Violence Research Group , The University of Sheffiled SEAS , Sheffield , UK
  • 3 Faculty of Nursing , Memorial University of Newfoundland , St. John's , Newfoundland and Labrador , Canada
  • 4 Swat College of Nursing , Mingora, Swat , Pakistan
  • Correspondence to Ahtisham Younas, Memorial University of Newfoundland, St. John's, NL A1C 5S7, Canada; ay6133{at}mun.ca

https://doi.org/10.1136/ebnurs-2021-103425

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • statistics & research methods

Introduction

A nurse educator is interested in finding out the academic and non-academic predictors of success in nursing students. Given the complexity of educational and clinical learning environments, demographic, clinical and academic factors (age, gender, previous educational training, personal stressors, learning demands, motivation, assignment workload, etc) influencing nursing students’ success, she was able to list various potential factors contributing towards success relatively easily. Nevertheless, not all of the identified factors will be plausible predictors of increased success. Therefore, she could use a powerful statistical procedure called regression analysis to identify whether the likelihood of increased success is influenced by factors such as age, stressors, learning demands, motivation and education.

What is regression?

Purposes of regression analysis.

Regression analysis has four primary purposes: description, estimation, prediction and control. 1 , 2 By description, regression can explain the relationship between dependent and independent variables. Estimation means that by using the observed values of independent variables, the value of dependent variable can be estimated. 2 Regression analysis can be useful for predicting the outcomes and changes in dependent variables based on the relationships of dependent and independent variables. Finally, regression enables in controlling the effect of one or more independent variables while investigating the relationship of one independent variable with the dependent variable. 1

Types of regression analyses

There are commonly three types of regression analyses, namely, linear, logistic and multiple regression. The differences among these types are outlined in table 1 in terms of their purpose, nature of dependent and independent variables, underlying assumptions, and nature of curve. 1 , 3 However, more detailed discussion for linear regression is presented as follows.

  • View inline

Comparison of linear, logistic and multiple regression

Linear regression and interpretation

Linear regression analysis involves examining the relationship between one independent and dependent variable. Statistically, the relationship between one independent variable (x) and a dependent variable (y) is expressed as: y= β 0 + β 1 x+ε. In this equation, β 0 is the y intercept and refers to the estimated value of y when x is equal to 0. The coefficient β 1 is the regression coefficient and denotes that the estimated increase in the dependent variable for every unit increase in the independent variable. The symbol ε is a random error component and signifies imprecision of regression indicating that, in actual practice, the independent variables are cannot perfectly predict the change in any dependent variable. 1 Multiple linear regression follows the same logic as univariate linear regression except (a) multiple regression, there are more than one independent variable and (b) there should be non-collinearity among the independent variables.

Factors affecting regression

Linear and multiple regression analyses are affected by factors, namely, sample size, missing data and the nature of sample. 2

Small sample size may only demonstrate connections among variables with strong relationship. Therefore, sample size must be chosen based on the number of independent variables and expect strength of relationship.

Many missing values in the data set may affect the sample size. Therefore, all the missing values should be adequately dealt with before conducting regression analyses.

The subsamples within the larger sample may mask the actual effect of independent and dependent variables. Therefore, if subsamples are predefined, a regression within the sample could be used to detect true relationships. Otherwise, the analysis should be undertaken on the whole sample.

Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor. 4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction). Ali and Naylor’s chosen academic independent variables were preadmission qualification, previous academic performance and school type and the non-academic variables were age, gender, marital status and time gap. To achieve their purpose, they collected data from 628 nursing students between the age range of 15–34 years. They used both linear and multiple regression analyses to identify the predictors of student success. For analysis, they examined the relationship of academic and non-academic variables across different years of study and noted that academic factors accounted for 36.6%, 44.3% and 50.4% variability in academic success of students in year 1, year 2 and year 3, respectively. 4

Ali and Naylor presented the relationship among these variables using scatter plots, which are commonly used graphs for data display in regression analysis—see examples of various scatter plots in figure 1 . 4 In a scatter plot, the clustering of the dots denoted the strength of relationship, whereas the direction indicates the nature of relationships among variables as positive (ie, increase in one variable results in an increase in the other) and negative (ie, increase in one variable results in decrease in the other).

  • Download figure
  • Open in new tab
  • Download powerpoint

An Example of Scatter Plot for Regression.

Table 2 presents the results of regression analysis for academic and non-academic variables for year 4 students’ success. The significant predictors of student success are denoted with a significant p value. For every, significant predictor, the beta value indicates the percentage increase in students’ academic success with one unit increase in the variable.

Regression model for the final year students (N=343)

Conclusions

Regression analysis is a powerful and useful statistical procedure with many implications for nursing research. It enables researchers to describe, predict and estimate the relationships and draw plausible conclusions about the interrelated variables in relation to any studied phenomena. Regression also allows for controlling one or more variables when researchers are interested in examining the relationship among specific variables. Some of the key considerations are presented that may be useful for researchers undertaking regression analysis. While planning and conducting regression analysis, researchers should consider the type and number of dependent and independent variables as well as the nature and size of sample. Choosing a wrong type of regression analysis with small sample may result in erroneous conclusions about the studied phenomenon.

Ethics statements

Patient consent for publication.

Not required.

  • Montgomery DC ,
  • Schneider A ,

Twitter @parveenazamali, @@Ahtisham04

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Commissioned; internally peer reviewed.

Read the full text or download the PDF:

Root out friction in every digital experience, super-charge conversion rates, and optimise digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Meet the operating system for experience management

  • Free Account
  • Product Demos
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence

Market Research

  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results.

language

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Ultimate Guide to Market Research
  • Regression Analysis

Try Qualtrics for free

Regression analysis: the ultimate guide.

19 min read In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

When you rely on data to drive and guide business decisions, as well as predict market trends, just  gathering and analysing  what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between dependent variables and independent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: The ultimate guide to conducting market research

What is regression analysis?

Regression analysis is a statistical method. It’s used for  analysing different factors  that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also  help leaders understand  how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between  the number of marketers  employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analszing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with  variables that are categorised into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analysing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyse and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS)  or  customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when  new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This  regression line  is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the  regression line.  The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organisation wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualise those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyse the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that  the data collected and statistical methods used to analyse it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyse

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualisations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

eBook: The ultimate guide to conducting market research

Related resources

Market intelligence 9 min read, qualitative research questions 11 min read, ethnographic research 11 min read, business research methods 12 min read, qualitative research design 12 min read, business research 10 min read, qualitative research interviews 11 min read, request demo.

Ready to learn more about Qualtrics?

Linear Regression Analysis: Simple & Multiple Models, Data

James Madison University *

Jun 12, 2024

Uploaded by BrigadierOctopusMaster942

  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Text book image

  • Please *Find the equation of the least-squares regression line that models the data. *Graph the data and the regression line in the same viewing window using the parameters given below the graph choices. Choose the correct graph below. *Estimate the tuition and fees in 2005. arrow_forward Create a scatterplot of the data. Choose the correct graph Identify a characteristic of the data that is ignored by the regression line. arrow_forward when a regression is used as a method of predicting dependent variables from one or more independent variables. How are the independent variables different from each other yet related to the dependent variable? arrow_forward
  • The following result perspective in RapidMiner shows a multiple linear regression model. Based on the diagram, the model for our dependent variable Y is Predicted Y= (Insulation *0.420)+(Temperature *0.071)+(Avg_Age*0.065)+(Home_Size *0.311)+7.589 Attribute Insulation Temperature Avg Age Home Size (Intercept) O True O False Coefficient 3.323 -0.869 1.968 3.173 134.511 Std. Error 0.420 0.071 0.065 0.311 7.589 Std. Coefficient 0.164 -0.262 0.527 0.131 ? Tolerance 0.431 0.405 0.491 0.914 ? t-Stat 7.906 -12.222 30.217 10.210 17.725 arrow_forward (BIOSTATISTICS) In this question , What characteristics are associated with BMI? Use simple and multivariable linear regression analysis to complete the following table relating the characteristics listed to BMI as a continuous variable. Before conducting the analysis, be sure that all participants have complete data on all analysis variables. If participants are excluded due to missing data, the numbers excluded should be reported. Then, describe how each characteristic is related to BMI. Are crude and multivariable effects similar? What might explain or account for any differences? Outcome Variable: BMI, kg/m2   Characteristic   Regression Coefficient Crude Models p-value Regression Coefficient Multivariable Model P-value   Age, years 0.0627 <0.001-0.02155 0.004 Male sex -0.580 <0.001-09884 <0.001 Systolic blood pressure, mmHg 0.0603 <0.0010.05716            <0.001 Total serum cholesterol, mg/dL 0.0113 <0.0010.00638… arrow_forward Is a linear model appropriate for this type of data? Explain. arrow_forward
  • Give examples of where the use of regression analysis can be benificially be made. arrow_forward Can you please calculate the linear regression (equation of line, interpolation and extrapolation) arrow_forward The regression line always gives an exact model for data. true or false arrow_forward
  • What are some examples of ways in which  linear regression to create a beneficial statistical outcome, in a business setting? arrow_forward give an easy example of an simple linear regression with solution and line graph arrow_forward Define the Linear Regression Model. Also explain Terminology for the Linear Regression Model with a Single Regressor? arrow_forward
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Institute for Digital Research and Education

What statistical analysis should I use? Statistical analyses using SPSS

Introduction.

This page shows how to perform a number of statistical tests using SPSS.  Each section gives a brief description of the aim of the statistical test, when it is used, an example showing the SPSS commands and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the page Choosing the Correct Statistical Test for a table that shows an overview of when each test is appropriate to use.  In deciding which test is appropriate to use, it is important to consider the type of variables that you have (i.e., whether your variables are categorical, ordinal or interval and whether they are normally distributed), see What is the difference between categorical, ordinal and interval variables? for more information on this.

About the hsb data file

Most of the examples in this page will use a data file called hsb2, high school and beyond.  This data file contains 200 observations from a sample of high school students with demographic information about the students, such as their gender ( female ), socio-economic status ( ses ) and ethnic background ( race ). It also contains a number of scores on standardized tests, including tests of reading ( read ), writing ( write ), mathematics ( math ) and social studies ( socst ). You can get the hsb data file by clicking on hsb2 .

One sample t-test

A one sample t-test allows us to test whether a sample mean (of a normally distributed interval variable) significantly differs from a hypothesized value.  For example, using the hsb2 data file , say we wish to test whether the average writing score ( write ) differs significantly from 50.  We can do this as shown below. t-test  /testval = 50  /variable = write. The mean of the variable write for this particular sample of students is 52.775, which is statistically significantly different from the test value of 50.  We would conclude that this group of students has a significantly higher mean on the writing test than 50.

One sample median test

A one sample median test allows us to test whether a sample median differs significantly from a hypothesized value.  We will use the same variable, write , as we did in the one sample t-test example above, but we do not need to assume that it is interval and normally distributed (we only need to assume that write is an ordinal variable). nptests /onesample test (write) wilcoxon(testvalue = 50).

Image spss_whatstat_median1

Binomial test

A one sample binomial test allows us to test whether the proportion of successes on a two-level categorical dependent variable significantly differs from a hypothesized value.  For example, using the hsb2 data file , say we wish to test whether the proportion of females ( female ) differs significantly from 50%, i.e., from .5.  We can do this as shown below. npar tests  /binomial (.5) = female. The results indicate that there is no statistically significant difference (p = .229).  In other words, the proportion of females in this sample does not significantly differ from the hypothesized value of 50%.

Chi-square goodness of fit

A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions.  For example, let’s suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks.  We want to test whether the observed proportions from our sample differ significantly from these hypothesized proportions. npar test   /chisquare = race  /expected = 10 10 10 70. These results show that racial composition in our sample does not differ significantly from the hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).

Two independent samples t-test

An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups.  For example, using the hsb2 data file , say we wish to test whether the mean for write is the same for males and females. t-test groups = female(0 1)   /variables = write. Because the standard deviations for the two groups are similar (10.3 and 8.1), we will use the “equal variances assumed” test.  The results indicate that there is a statistically significant difference between the mean writing score for males and females (t = -3.734, p = .000).  In other words, females have a statistically significantly higher mean score on writing (54.99) than males (50.12). See also SPSS Learning Module: An overview of statistical tests in SPSS

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal).  You will notice that the SPSS syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test.  We will use the same data file (the hsb2 data file ) and the same variables in this example as we did in the independent t-test example above and will not assume that write , our dependent variable, is normally distributed.
npar test /m-w = write by female(0 1). The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001). See also FAQ: Why is the Mann-Whitney significant when the medians are equal?

Chi-square test

A chi-square test is used when you want to see if there is a relationship between two categorical variables.  In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command to obtain the test statistic and its associated p-value.  Using the hsb2 data file , let’s see if there is a relationship between the type of school attended ( schtyp ) and students’ gender ( female ).  Remember that the chi-square test assumes that the expected value for each cell is five or higher. This assumption is easily met in the examples below.  However, if this assumption is not met in your data, please see the section on Fisher’s exact test below. crosstabs /tables = schtyp by female /statistic = chisq. These results indicate that there is no statistically significant relationship between the type of school attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828). Let’s look at another example, this time looking at the linear relationship between gender ( female ) and socio-economic status ( ses ).  The point of this example is that one (or both) variables may have more than two levels, and that the variables do not have to have the same number of levels.  In this example, female has two levels (male and female) and ses has three levels (low, medium and high). crosstabs /tables = female by ses /statistic = chisq. Again we find that there is no statistically significant relationship between the variables (chi-square with two degrees of freedom = 4.577, p = 0.101). See also SPSS Learning Module: An Overview of Statistical Tests in SPSS

Fisher’s exact test

The Fisher’s exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less.  Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher’s exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS Exact Test Module, you can only perform a Fisher’s exact test on a 2×2 table, and these results are presented by default.  Please see the results from the chi squared example above.

One-way ANOVA

A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.  For example, using the hsb2 data file , say we wish to test whether the mean of write differs between the three program types ( prog ).  The command for this test would be: oneway write by prog. The mean of the dependent variable differs significantly among the levels of program type.  However, we do not know if the difference is between only two of the levels or all three of the levels.  (The F test for the Model is the same as the F test for prog because prog was the only variable entered into the model.  If other variables had also been entered, the F test for the Model would have been different from prog .)  To see the mean of write for each level of program type, means tables = write by prog. From this we can see that the students in the academic program have the highest mean writing score, while students in the vocational program have the lowest. See also SPSS Textbook Examples: Design and Analysis, Chapter 7 SPSS Textbook Examples: Applied Regression Analysis, Chapter 8 SPSS FAQ: How can I do ANOVA contrasts in SPSS? SPSS Library: Understanding and Interpreting Parameter Estimates in Regression and ANOVA

Kruskal Wallis test

The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a generalized form of the Mann-Whitney test method since it permits two or more groups.  We will use the same data file as the one way ANOVA example above (the hsb2 data file ) and the same variables as in the example above, but we will not assume that write is a normally distributed interval variable. npar tests /k-w = write by prog (1,3). If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different value of chi-squared.  With or without ties, the results indicate that there is a statistically significant difference among the three type of programs.

Paired t-test

A paired (samples) t-test is used when you have two related observations (i.e., two observations per subject) and you want to see if the means on these two normally distributed interval variables differ from one another.  For example, using the hsb2 data file we will test whether the mean of read is equal to the mean of write . t-test pairs = read with write (paired). These results indicate that the mean of read is not statistically significantly different from the mean of write (t = -0.867, p = 0.387).

Wilcoxon signed rank sum test

The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test.  You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal). We will use the same example as above, but we will not assume that the difference between read and write is interval and normally distributed. npar test /wilcoxon = write with read (paired). The results suggest that there is not a statistically significant difference between read and write . If you believe the differences between read and write were not ordinal but could merely be classified as positive and negative, then you may want to consider a sign test in lieu of sign rank test.  Again, we will use the same variables in this example and assume that this difference is not ordinal. npar test /sign = read with write (paired). We conclude that no statistically significant difference was found (p=.556).

McNemar test

You would perform McNemar’s test if you were interested in the marginal frequencies of two binary outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group.  Continuing with the hsb2 dataset used in several above examples, let us create two binary outcomes in our dataset: himath and hiread . These outcomes can be considered in a two-way contingency table.  The null hypothesis is that the proportion of students in the himath group is the same as the proportion of students in hiread group (i.e., that the contingency table is symmetric). compute himath = (math>60). compute hiread = (read>60). execute. crosstabs /tables=himath BY hiread /statistic=mcnemar /cells=count. McNemar’s chi-square statistic suggests that there is not a statistically significant difference in the proportion of students in the himath group and the proportion of students in the hiread group.

One-way repeated measures ANOVA

You would perform a one-way repeated measures analysis of variance if you had one categorical independent variable and a normally distributed interval dependent variable that was repeated at least twice for each subject.  This is the equivalent of the paired samples t-test, but allows for two or more levels of the categorical variable. This tests whether the mean of the dependent variable differs by the categorical variable.  We have an example data set called rb4wide , which is used in Kirk’s book Experimental Design.  In this data set, y is the dependent variable, a is the repeated measure and s is the variable that indicates the subject number. glm y1 y2 y3 y4 /wsfactor a(4). You will notice that this output gives four different p-values.  The output labeled “sphericity assumed”  is the p-value (0.000) that you would get if you assumed compound symmetry in the variance-covariance matrix.  Because that assumption is often not valid, the three other p-values offer various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound).  No matter which p-value you use, our results indicate that we have a statistically significant effect of a at the .05 level. See also SPSS Textbook Examples from Design and Analysis: Chapter 16 SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA SPSS Code Fragment: Repeated Measures ANOVA

Repeated measures logistic regression

If you have a binary outcome measured repeatedly for each subject and you wish to run a logistic regression that accounts for the effect of multiple measures from single subjects, you can perform a repeated measures logistic regression.  In SPSS, this can be done using the GENLIN command and indicating binomial as the probability distribution and logit as the link function to be used in the model. The exercise data file contains 3 pulse measurements from each of 30 people assigned to 2 different diet regiments and 3 different exercise regiments. If we define a “high” pulse as being over 100, we can then predict the probability of a high pulse using diet regiment. GET FILE='C:mydatahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/exercise.sav'. GENLIN highpulse (REFERENCE=LAST) BY diet (order = DESCENDING) /MODEL diet DISTRIBUTION=BINOMIAL LINK=LOGIT /REPEATED SUBJECT=id CORRTYPE = EXCHANGEABLE. These results indicate that diet is not statistically significant (Wald Chi-Square = 1.562, p = 0.211).

Factorial ANOVA

A factorial ANOVA has two or more categorical independent variables (either with or without the interactions) and a single normally distributed interval dependent variable.  For example, using the hsb2 data file we will look at writing scores ( write ) as the dependent variable and gender ( female ) and socio-economic status ( ses ) as independent variables, and we will include an interaction of female by ses .  Note that in SPSS, you do not need to have the interaction term(s) in your data set.  Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables that will make up the interaction term(s). glm write by female ses. These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00).  The variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p = 0.002, respectively).  However, that interaction between female and ses is not statistically significant (F = 0.133, p = 0.875). See also SPSS Textbook Examples from Design and Analysis: Chapter 10 SPSS FAQ: How can I do tests of simple main effects in SPSS? SPSS FAQ: How do I plot ANOVA cell means in SPSS? SPSS Library: An Overview of SPSS GLM

Friedman test

You perform a Friedman test when you have one within-subjects independent variable with two or more levels and a dependent variable that is not interval and normally distributed (but at least ordinal).  We will use this test to determine if there is a difference in the reading, writing and math scores.  The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e., reading, writing and math) are the same.  To conduct a Friedman test, the data need to be in a long format.  SPSS handles this for you, but in other statistical packages you will have to reshape the data before you can conduct this test. npar tests /friedman = read write math. Friedman’s chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant.  Hence, there is no evidence that the distributions of the three types of scores are different.

Ordered logistic regression

Ordered logistic regression is used when the dependent variable is ordered, but not continuous.  For example, using the hsb2 data file we will create an ordered variable called write3 .  This variable will have the values 1, 2 and 3, indicating a low, medium or high writing score.  We do not generally recommend categorizing a continuous variable in this way; we are simply creating a variable to use for this example.  We will use gender ( female ), reading score ( read ) and social studies score ( socst ) as predictor variables in this model.  We will use a logit link and on the print subcommand we have requested the parameter estimates, the (model) summary statistics and the test of the parallel lines assumption. if write ge 30 and write le 48 write3 = 1. if write ge 49 and write le 57 write3 = 2. if write ge 58 and write le 70 write3 = 3. execute. plum write3 with female read socst /link = logit /print = parameter summary tparallel. The results indicate that the overall model is statistically significant (p < .000), as are each of the predictor variables (p < .000).  There are two thresholds for this model because there are three levels of the outcome variable.  We also see that the test of the proportional odds assumption is non-significant (p = .563).  One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same.  In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc.  This is called the proportional odds assumption or the parallel regression assumption.  Because the relationship between all pairs of groups is the same, there is only one set of coefficients (only one model).  If this was not the case, we would need different models (such as a generalized ordered logit model) to describe the relationship between each pair of outcome groups. See also SPSS Data Analysis Examples: Ordered logistic regression SPSS Annotated Output:  Ordinal Logistic Regression

Factorial logistic regression

A factorial logistic regression is used when you have two or more categorical independent variables but a dichotomous dependent variable.  For example, using the hsb2 data file we will use female as our dependent variable, because it is the only dichotomous variable in our data set; certainly not because it common practice to use gender as an outcome variable.  We will use type of program ( prog ) and school type ( schtyp ) as our predictor variables.  Because prog is a categorical variable (it has three levels), we need to create dummy codes for it. SPSS will do this for you by making dummy codes for all variables listed after the keyword with .  SPSS will also create the interaction term; simply list the two variables that will make up the interaction separated by the keyword by . logistic regression female with prog schtyp prog by schtyp /contrast(prog) = indicator(1). The results indicate that the overall model is not statistically significant (LR chi2 = 3.147, p = 0.677).  Furthermore, none of the coefficients are statistically significant either.  This shows that the overall effect of prog is not significant. See also Annotated output for logistic regression

Correlation

A correlation is useful when you want to see the relationship between two (or more) normally distributed interval variables.  For example, using the hsb2 data file we can run a correlation between two continuous variables, read and write . correlations /variables = read write. In the second example, we will run a correlation between a dichotomous variable, female , and a continuous variable, write . Although it is assumed that the variables are interval and normally distributed, we can include dummy variables when performing correlations. correlations /variables = female write. In the first example above, we see that the correlation between read and write is 0.597.  By squaring the correlation and then multiplying by 100, you can determine what percentage of the variability is shared.  Let’s round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be 36%.  Hence read shares about 36% of its variability with write .  In the output for the second example, we can see the correlation between write and female is 0.256.  Squaring this number yields .065536, meaning that female shares approximately 6.5% of its variability with write . See also Annotated output for correlation SPSS Learning Module: An Overview of Statistical Tests in SPSS SPSS FAQ: How can I analyze my data by categories? Missing Data in SPSS

Simple linear regression

Simple linear regression allows us to look at the linear relationship between one normally distributed interval predictor and one normally distributed interval outcome variable.  For example, using the hsb2 data file , say we wish to look at the relationship between writing scores ( write ) and reading scores ( read ); in other words, predicting write from read . regression variables = write read /dependent = write /method = enter. We see that the relationship between write and read is positive (.552) and based on the t-value (10.47) and p-value (0.000), we would conclude this relationship is statistically significant.  Hence, we would say there is a statistically significant positive linear relationship between reading and writing. See also Regression With SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Textbook Examples: Introduction to the Practice of Statistics, Chapter 10 SPSS Textbook Examples: Regression with Graphics, Chapter 2 SPSS Textbook Examples: Applied Regression Analysis, Chapter 5

Non-parametric correlation

A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated.  In our example, we will look for a relationship between read and write .  We will not assume that both of these variables are normal and interval. nonpar corr /variables = read write /print = spearman. The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is statistically significant.

Simple logistic regression

Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1).  We have only one variable in the hsb2 data file that is coded 0 and 1, and that is female .  We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output.  The first variable listed after the logistic command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables.  In our example, female will be the outcome variable, and read will be the predictor variable.  As with OLS regression, the predictor variables must be either dichotomous or continuous; they cannot be categorical. logistic regression female with read. The results indicate that reading score ( read ) is not a statistically significant predictor of gender (i.e., being female), Wald = .562, p = 0.453. Likewise, the test of the overall model is not statistically significant, LR chi-squared – 0.56, p = 0.453. See also Annotated output for logistic regression SPSS Library: What kind of contrasts are these?

Multiple regression

Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation.  For example, using the hsb2 data file we will predict writing score from gender ( female ), reading, math, science and social studies ( socst ) scores. regression variable = write female read math science socst /dependent = write /method = enter. The results indicate that the overall model is statistically significant (F = 58.60, p = 0.000).  Furthermore, all of the predictor variables are statistically significant except for read . See also Regression with SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Frequently Asked Questions SPSS Textbook Examples: Regression with Graphics, Chapter 3 SPSS Textbook Examples: Applied Regression Analysis

Analysis of covariance

Analysis of covariance is like ANOVA, except in addition to the categorical predictors you also have continuous predictors as well.  For example, the one way ANOVA example used write as the dependent variable and prog as the independent variable.  Let’s add read as a continuous variable to this model, as shown below. glm write with read by prog. The results indicate that even after adjusting for reading score ( read ), writing scores still significantly differ by program type ( prog ), F = 5.867, p = 0.003. See also SPSS Textbook Examples from Design and Analysis: Chapter 14 SPSS Library: An Overview of SPSS GLM SPSS Library: How do I handle interactions of continuous and categorical variables?

Multiple logistic regression

Multiple logistic regression is like simple logistic regression, except that there are two or more predictors.  The predictors can be interval variables or dummy variables, but cannot be categorical variables.  If you have categorical predictors, they should be coded into one or more dummy variables. We have only one variable in our data set that is coded 0 and 1, and that is female .  We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output.  The first variable listed after the logistic regression command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables (listed after the keyword with ).  In our example, female will be the outcome variable, and read and write will be the predictor variables. logistic regression female with read write. These results show that both read and write are significant predictors of female . See also Annotated output for logistic regression SPSS Textbook Examples: Applied Logistic Regression, Chapter 2 SPSS Code Fragments: Graphing Results in Logistic Regression

Discriminant analysis

Discriminant analysis is used when you have one or more normally distributed interval independent variables and a categorical dependent variable.  It is a multivariate technique that considers the latent dimensions in the independent variables for predicting group membership in the categorical dependent variable.  For example, using the hsb2 data file , say we wish to use read , write and math scores to predict the type of program a student belongs to ( prog ). discriminate groups = prog(1, 3) /variables = read write math. Clearly, the SPSS output for this procedure is quite lengthy, and it is beyond the scope of this page to explain all of it.  However, the main point is that two canonical variables are identified by the analysis, the first of which seems to be more related to program type than the second. See also discriminant function analysis SPSS Library: A History of SPSS Statistical Features

One-way MANOVA

MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. For example, using the hsb2 data file , say we wish to examine the differences in read , write and math broken down by program type ( prog ). glm read write math by prog. The students in the different programs differ in their joint distribution of read , write and math . See also SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA GLM: MANOVA and MANCOVA SPSS Library: MANOVA and GLM

Multivariate multiple regression

Multivariate multiple regression is used when you have two or more dependent variables that are to be predicted from two or more independent variables.  In our example using the hsb2 data file , we will predict write and read from female , math , science and social studies ( socst ) scores. glm write read with female math science socst. These results show that all of  the variables in the model have a statistically significant relationship with the joint distribution of write and read .

Canonical correlation

Canonical correlation is a multivariate technique used to examine the relationship between two groups of variables.  For each set of variables, it creates latent variables and looks at the relationships among the latent variables. It assumes that all variables in the model are interval and normally distributed.  SPSS requires that each of the two groups of variables be separated by the keyword with .  There need not be an equal number of variables in the two groups (before and after the with ). manova read write with math science /discrim. * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. WITHIN CELLS Regression Multivariate Tests of Significance (S = 2, M = -1/2, N = 97 ) Test Name Value Approx. F Hypoth. DF Error DF Sig. of F Pillais .59783 41.99694 4.00 394.00 .000 Hotellings 1.48369 72.32964 4.00 390.00 .000 Wilks .40249 56.47060 4.00 392.00 .000 Roys .59728 Note.. F statistic for WILKS' Lambda is exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. WITHIN CELLS Regression (Cont.) Univariate F-tests with (2,197) D. F. Variable Sq. Mul. R Adj. R-sq. Hypoth. MS Error MS F READ .51356 .50862 5371.66966 51.65523 103.99081 WRITE .43565 .42992 3894.42594 51.21839 76.03569 Variable Sig. of F READ .000 WRITE .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .063 WRITE .049 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .649 WRITE .467 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Correlations between DEPENDENT and canonical variables Function No. Variable 1 READ .927 WRITE .854 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Variance in dependent variables explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 79.441 79.441 47.449 47.449 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for COVARIATES Function No. COVARIATE 1 MATH .067 SCIENCE .048 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for COVARIATES CAN. VAR. COVARIATE 1 MATH .628 SCIENCE .478 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Correlations between COVARIATES and canonical variables CAN. VAR. Covariate 1 MATH .929 SCIENCE .873 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Variance in covariates explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 48.544 48.544 81.275 81.275 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Regression analysis for WITHIN CELLS error term --- Individual Univariate .9500 confidence intervals Dependent variable .. READ reading score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .48129 .43977 .070 6.868 .000 SCIENCE .36532 .35278 .066 5.509 .000 COVARIATE Lower -95% CL- Upper MATH .343 .619 SCIENCE .235 .496 Dependent variable .. WRITE writing score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .43290 .42787 .070 6.203 .000 SCIENCE .28775 .30057 .066 4.358 .000 COVARIATE Lower -95% CL- Upper MATH .295 .571 SCIENCE .158 .418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT Multivariate Tests of Significance (S = 1, M = 0, N = 97 ) Test Name Value Exact F Hypoth. DF Error DF Sig. of F Pillais .11544 12.78959 2.00 196.00 .000 Hotellings .13051 12.78959 2.00 196.00 .000 Wilks .88456 12.78959 2.00 196.00 .000 Roys .11544 Note.. F statistics are exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Univariate F-tests with (1,197) D. F. Variable Hypoth. SS Error SS Hypoth. MS Error MS F Sig. of F READ 336.96220 10176.0807 336.96220 51.65523 6.52329 .011 WRITE 1209.88188 10090.0231 1209.88188 51.21839 23.62202 .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Raw discriminant function coefficients Function No. Variable 1 READ .041 WRITE .124 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized discriminant function coefficients Function No. Variable 1 READ .293 WRITE .889 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Estimates of effects for canonical variables Canonical Variable Parameter 1 1 2.196 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT (Cont.) Correlations between DEPENDENT and canonical variables Canonical Variable Variable 1 READ .504 WRITE .959 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The output above shows the linear combinations corresponding to the first canonical correlation.  At the bottom of the output are the two canonical correlations. These results indicate that the first canonical correlation is .7728.  The F-test in this output tests the hypothesis that the first canonical correlation is equal to zero.  Clearly, F = 56.4706 is statistically significant.  However, the second canonical correlation of .0235 is not statistically significantly different from zero (F = 0.1087, p = 0.7420).

Factor analysis

Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables.  All variables involved in the factor analysis need to be interval and are assumed to be normally distributed.  The goal of the analysis is to try to identify factors which underlie the variables.  There may be fewer factors than variables, but there may not be more factors than variables.  For our example using the hsb2 data file , let’s suppose that we think that there are some common factors underlying the various test scores.  We will include subcommands for varimax rotation and a plot of the eigenvalues.  We will use a principal components extraction and will retain two factors. (Using these options will make our results compatible with those from SAS and Stata and are not necessarily the options that you will want to use.) factor /variables read write math science socst /criteria factors(2) /extraction pc /rotation varimax /plot eigen. Communality (which is the opposite of uniqueness) is the proportion of variance of the variable (i.e., read ) that is accounted for by all of the factors taken together, and a very low communality can indicate that a variable may not belong with any of the factors.  The scree plot may be useful in determining how many factors to retain.  From the component matrix table, we can see that all five of the test scores load onto the first factor, while all five tend to load not so heavily on the second factor.  The purpose of rotating the factors is to get the variables to load either very high or very low on each factor.  In this example, because all of the variables loaded onto factor 1 and not on factor 2, the rotation did not aid in the interpretation. Instead, it made the results even more difficult to interpret. See also SPSS FAQ: What does Cronbach’s alpha mean?

Your Name (required)

Your Email (must be a valid email for us to receive the report!)

Comment/Error Report (required)

How to cite this page

  • © 2021 UC REGENTS
  • Top Courses
  • Online Degrees
  • Find your New Career
  • Join for Free

What Is Data Analysis? (With Examples)

Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions.

[Featured image] A female data analyst takes notes on her laptop at a standing desk in a modern office space

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock Holme's proclaims in Sir Arthur Conan Doyle's A Scandal in Bohemia.

This idea lies at the root of data analysis. When we can extract meaning from data, it empowers us to make better decisions. And we’re living in a time when we have more data than ever at our fingertips.

Companies are wisening up to the benefits of leveraging data. Data analysis can help a bank to personalize customer interactions, a health care system to predict future health needs, or an entertainment company to create the next big streaming hit.

The World Economic Forum Future of Jobs Report 2023 listed data analysts and scientists as one of the most in-demand jobs, alongside AI and machine learning specialists and big data specialists [ 1 ]. In this article, you'll learn more about the data analysis process, different types of data analysis, and recommended courses to help you get started in this exciting field.

Read more: How to Become a Data Analyst (with or Without a Degree)

Beginner-friendly data analysis courses

Interested in building your knowledge of data analysis today? Consider enrolling in one of these popular courses on Coursera:

In Google's Foundations: Data, Data, Everywhere course, you'll explore key data analysis concepts, tools, and jobs.

In Duke University's Data Analysis and Visualization course, you'll learn how to identify key components for data analytics projects, explore data visualization, and find out how to create a compelling data story.

Data analysis process

As the data available to companies continues to grow both in amount and complexity, so too does the need for an effective and efficient process by which to harness the value of that data. The data analysis process typically moves through several iterative phases. Let’s take a closer look at each.

Identify the business question you’d like to answer. What problem is the company trying to solve? What do you need to measure, and how will you measure it? 

Collect the raw data sets you’ll need to help you answer the identified question. Data collection might come from internal sources, like a company’s client relationship management (CRM) software, or from secondary sources, like government records or social media application programming interfaces (APIs). 

Clean the data to prepare it for analysis. This often involves purging duplicate and anomalous data, reconciling inconsistencies, standardizing data structure and format, and dealing with white spaces and other syntax errors.

Analyze the data. By manipulating the data using various data analysis techniques and tools, you can begin to find trends, correlations, outliers, and variations that tell a story. During this stage, you might use data mining to discover patterns within databases or data visualization software to help transform data into an easy-to-understand graphical format.

Interpret the results of your analysis to see how well the data answered your original question. What recommendations can you make based on the data? What are the limitations to your conclusions? 

You can complete hands-on projects for your portfolio while practicing statistical analysis, data management, and programming with Meta's beginner-friendly Data Analyst Professional Certificate . Designed to prepare you for an entry-level role, this self-paced program can be completed in just 5 months.

Or, L earn more about data analysis in this lecture by Kevin, Director of Data Analytics at Google, from Google's Data Analytics Professional Certificate :

Read more: What Does a Data Analyst Do? A Career Guide

Types of data analysis (with examples)

Data can be used to answer questions and support decisions in many different ways. To identify the best way to analyze your date, it can help to familiarize yourself with the four types of data analysis commonly used in the field.

In this section, we’ll take a look at each of these data analysis methods, along with an example of how each might be applied in the real world.

Descriptive analysis

Descriptive analysis tells us what happened. This type of analysis helps describe or summarize quantitative data by presenting statistics. For example, descriptive statistical analysis could show the distribution of sales across a group of employees and the average sales figure per employee. 

Descriptive analysis answers the question, “what happened?”

Diagnostic analysis

If the descriptive analysis determines the “what,” diagnostic analysis determines the “why.” Let’s say a descriptive analysis shows an unusual influx of patients in a hospital. Drilling into the data further might reveal that many of these patients shared symptoms of a particular virus. This diagnostic analysis can help you determine that an infectious agent—the “why”—led to the influx of patients.

Diagnostic analysis answers the question, “why did it happen?”

Predictive analysis

So far, we’ve looked at types of analysis that examine and draw conclusions about the past. Predictive analytics uses data to form projections about the future. Using predictive analysis, you might notice that a given product has had its best sales during the months of September and October each year, leading you to predict a similar high point during the upcoming year.

Predictive analysis answers the question, “what might happen in the future?”

Prescriptive analysis

Prescriptive analysis takes all the insights gathered from the first three types of analysis and uses them to form recommendations for how a company should act. Using our previous example, this type of analysis might suggest a market plan to build on the success of the high sales months and harness new growth opportunities in the slower months. 

Prescriptive analysis answers the question, “what should we do about it?”

This last type is where the concept of data-driven decision-making comes into play.

Read more : Advanced Analytics: Definition, Benefits, and Use Cases

What is data-driven decision-making (DDDM)?

Data-driven decision-making, sometimes abbreviated to DDDM), can be defined as the process of making strategic business decisions based on facts, data, and metrics instead of intuition, emotion, or observation.

This might sound obvious, but in practice, not all organizations are as data-driven as they could be. According to global management consulting firm McKinsey Global Institute, data-driven companies are better at acquiring new customers, maintaining customer loyalty, and achieving above-average profitability [ 2 ].

Get started with Coursera

If you’re interested in a career in the high-growth field of data analytics, consider these top-rated courses on Coursera:

Begin building job-ready skills with the Google Data Analytics Professional Certificate . Prepare for an entry-level job as you learn from Google employees—no experience or degree required.

Practice working with data with Macquarie University's Excel Skills for Business Specialization . Learn how to use Microsoft Excel to analyze data and make data-informed business decisions.

Deepen your skill set with Google's Advanced Data Analytics Professional Certificate . In this advanced program, you'll continue exploring the concepts introduced in the beginner-level courses, plus learn Python, statistics, and Machine Learning concepts.

Frequently asked questions (FAQ)

Where is data analytics used ‎.

Just about any business or organization can use data analytics to help inform their decisions and boost their performance. Some of the most successful companies across a range of industries — from Amazon and Netflix to Starbucks and General Electric — integrate data into their business plans to improve their overall business performance. ‎

What are the top skills for a data analyst? ‎

Data analysis makes use of a range of analysis tools and technologies. Some of the top skills for data analysts include SQL, data visualization, statistical programming languages (like R and Python),  machine learning, and spreadsheets.

Read : 7 In-Demand Data Analyst Skills to Get Hired in 2022 ‎

What is a data analyst job salary? ‎

Data from Glassdoor indicates that the average base salary for a data analyst in the United States is $75,349 as of March 2024 [ 3 ]. How much you make will depend on factors like your qualifications, experience, and location. ‎

Do data analysts need to be good at math? ‎

Data analytics tends to be less math-intensive than data science. While you probably won’t need to master any advanced mathematics, a foundation in basic math and statistical analysis can help set you up for success.

Learn more: Data Analyst vs. Data Scientist: What’s the Difference? ‎

Article sources

World Economic Forum. " The Future of Jobs Report 2023 , https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf." Accessed March 19, 2024.

McKinsey & Company. " Five facts: How customer analytics boosts corporate performance , https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance." Accessed March 19, 2024.

Glassdoor. " Data Analyst Salaries , https://www.glassdoor.com/Salaries/data-analyst-salary-SRCH_KO0,12.htm" Accessed March 19, 2024.

Keep reading

Coursera staff.

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.

IMAGES

  1. What Is And How To Use A Multiple Regression Equation Model Example

    example of research using multiple regression analysis

  2. Multiple Regression Analysis Interpretation : SPSS Multiple Regression

    example of research using multiple regression analysis

  3. Introduction to Multiple Linear Regression

    example of research using multiple regression analysis

  4. PPT

    example of research using multiple regression analysis

  5. Multiple Linear Regression

    example of research using multiple regression analysis

  6. Multiple Regression Tools

    example of research using multiple regression analysis

VIDEO

  1. Multiple Regression Analysis

  2. Multiple Regression Analysis

  3. Multiple Regression Analysis / Determination / SE

  4. Multiple Regression in SPSS

  5. Solving multiple regression using Excel

  6. Statistics Module 15 V2

COMMENTS

  1. Research Using Multiple Regression Analysis: 1 Example with Conceptual

    This quickly done example of a research using multiple regression analysis revealed an interesting finding. The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated. The relationship means that the greater ...

  2. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  3. PDF Multiple Regression Analysis

    160 PART II: BAsIc And AdvAnced RegRessIon AnAlysIs 5A.4 Multiple Regression Research 5A.4.1 Research Problems Suggesting a Regression Approach If the research problem is expressed in a form that either specifies or implies prediction, multiple regression analysis becomes a viable candidate for the design. Here are some examples of research

  4. Multiple linear regression

    When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment ...

  5. Multiple linear regression: Theory and applications

    Photo by Ferdinand Stöhr on Unsplash. Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).

  6. Introduction to Multivariate Regression Analysis

    These questions can in principle be answered by multiple linear regression analysis. In the multiple linear regression model, Y has normal distribution with mean. The model parameters β 0 + β 1 + +β ρ and σ must be estimated from data. β 0 = intercept. β 1 β ρ = regression coefficients.

  7. PDF Multiple Regression

    Second, multiple regression is an extraordinarily versatile calculation, underly-ing many widely used Statistics methods. A sound understanding of the multiple regression model will help you to understand these other applications. Third, multiple regression offers our first glimpse into statistical models that use more than two quantitative ...

  8. PDF Fundamentals of Multiple Regression

    The value of t .025 is found in a t-table, using the usual df of t for assessing statistical significance of a regression coefficient (N - the num-ber of X's - 1), and is the value that leaves a tail of the t-curve with 2.5% of the total probability. For instance, if df = 30, then t.025 = 2.042.

  9. Education, income inequality, and mortality: a multiple regression analysis

    Objective: To test whether the relation between income inequality and mortality found in US states is because of different levels of formal education. Design: Cross sectional, multiple regression analysis. Setting: All US states and the District of Columbia (n=51). Data sources: US census statistics and vital statistics for the years 1989 and 1990. Main outcome measure: Multiple regression ...

  10. 15 Multiple Regression

    With multiple regression what we're doing is looking at the effect of each variable, while holding the other variable constant. Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant, and that change is highly significant.

  11. Section 5.3: Multiple Regression Explanation, Assumptions

    Multiple Regression Write Up. Here is an example of how to write up the results of a standard multiple regression analysis: In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable.

  12. (PDF) Multiple Regression: Methodology and Applications

    Abstract. Multiple regression is one of the most significant forms of regression and has a wide range. of applications. The study of the implementation of multiple regression analysis in different ...

  13. Introduction to Multiple Regression

    In this chapter, we will discuss multiple regression. In multiple regression analysis, a single outcome variable is modeled as a linear combination of as many additional variables as desired. Multiple regression is sometimes used simply to understand factors that are relevant to predicting an outcome, but it is also used in science to help ...

  14. Anxiety, Affect, Self-Esteem, and Stress: Mediation and ...

    Multiple linear regression analyses were used in order to examine moderation effects between anxiety, stress, self-esteem and affect on depression. The analysis indicated that about 52% of the variation in the dependent variable (i.e., depression) could be explained by the main effects and the interaction effects ( R 2 = .55, adjusted R 2 = .51 ...

  15. Getting started with Multivariate Multiple Regression

    Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score.

  16. PDF Multiple Regression Analysis of Performance Indicators in the ...

    3. Multiple regression analysis The main purpose of this analysis is to know to what extent is the profit size influenced by the five independent variables and what are those measures that should be taken based on the results obtained with using SPSS - Statistical Package for Social Sciences [C. Constantin, 2006]. The table below provides us the

  17. Multiple Regression Analysis using SPSS Statistics

    Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value ...

  18. PDF Example of Interpreting and Applying a Multiple Regression Model

    Analyze Descriptive Statistics Descriptives. Move the desired variables into the "Variables window". Check the box on the lower left - "Save standardized values as variables". When you run this command, you will get the requested statistics, and new variables will be added to the spread sheet.

  19. Multiple Linear Regression (MLR) Definition, Formula, and Example

    Multiple Linear Regression - MLR: Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of ...

  20. Multivariate Regression Analysis

    Examples of multivariate regression. Example 1. A researcher has collected data on three psychological variables, four academic variables (standardized test scores), and the type of educational program the student is in for 600 high school students. She is interested in how the set of psychological variables is related to the academic variables ...

  21. Multiple Regression Model

    In simple regression analysis, plotting a residual plot of the predictor variable \(\times \) residuals was effective. In multiple regression analysis, as shown in Fig. 15.5, it is effective to draw a residual plot of the predicted values of the criterion variable \(\times \) residuals. For example, it can be observed that country number 13 has ...

  22. PDF Multiple Regression Using SPSS

    Multiple Regression Using SPSS Performing the Analysis With SPSS Example 1: - We want to determine whether hours spent revising, anxiety scores, and A-level entry points have effect on exam scores for participants. Dependent variable: exam score Predictors: hours spent revising, anxiety scores, and A-level entry points.

  23. PDF Presenting the Results of a Multiple Regression Analysis

    You can expect to receive from me a few assignments in which I ask you to conduct a multiple regression analysis and then present the results. I suggest that you use the examples below as your models when preparing such assignments. Table 1. Graduate Grade Point Averages Related to Criteria Used When Making Admission Decisions (N = 30). Variable.

  24. Multiple Linear Regression in SPSS

    R denotes the multiple correlation coefficient. This is simply the Pearson correlation between the actual scores and those predicted by our regression model. R-square or R 2 is simply the squared multiple correlation. It is also the proportion of variance in the dependent variable accounted for by the entire regression model.

  25. PDF OBJECTIVES

    The research question for those using multiple regression concerns how the multiple independent variables, either by themselves or together, influence changes in the depen- ... analysis. The multiple regression example used in this chapter is as basic as possible—small sam - ... The research scenario centers on the belief that an individual ...

  26. Understanding and interpreting regression analysis

    Example. Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor.4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction).

  27. Regression Analysis: The Ultimate Guide

    When running regression analysis, be it a simple linear or multiple regression, it's really important to check that the assumptions your chosen method requires have been met. If your data points don't conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data.

  28. Linear Regression Analysis: Simple & Multiple Models, Data

    Simple Linear Regression: 5. Perform a simple linear regression using the ordinary least squares method on your dataset. a. Fit and transform your model on the dataset. b. Produce the summary output of the fitted model. c. Produce predictions from your input data (X). d.

  29. What statistical analysis should I use? Statistical analyses using SPSS

    Multivariate multiple regression. Multivariate multiple regression is used when you have two or more dependent variables that are to be predicted from two or more independent variables. In our example using the hsb2 data file, we will predict write and read from female, math, science and social studies (socst) scores.

  30. What Is Data Analysis? (With Examples)

    Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...