descriptive statistics in business research

Quant Analysis 101: Descriptive Statistics

Everything You Need To Get Started (With Examples)

By: Derek Jansen (MBA) | Reviewers: Kerryn Warren (PhD) | October 2023

If you’re new to quantitative data analysis , one of the first terms you’re likely to hear being thrown around is descriptive statistics. In this post, we’ll unpack the basics of descriptive statistics, using straightforward language and loads of examples . So grab a cup of coffee and let’s crunch some numbers!

Overview: Descriptive Statistics

What are descriptive statistics.

  • Descriptive vs inferential statistics
  • Why the descriptives matter
  • The “ Big 7 ” descriptive statistics
  • Key takeaways

At the simplest level, descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset – for example, a set of survey responses. They provide a snapshot of the characteristics of your dataset and allow you to better understand, roughly, how the data are “shaped” (more on this later). For example, a descriptive statistic could include the proportion of males and females within a sample or the percentages of different age groups within a population.

Another common descriptive statistic is the humble average (which in statistics-talk is called the mean ). For example, if you undertook a survey and asked people to rate their satisfaction with a particular product on a scale of 1 to 10, you could then calculate the average rating. This is a very basic statistic, but as you can see, it gives you some idea of how this data point is shaped .

Descriptive statistics summarise and describe relatively basic but essential features of a quantitative dataset, including its “shape”

What about inferential statistics?

Now, you may have also heard the term inferential statistics being thrown around, and you’re probably wondering how that’s different from descriptive statistics. Simply put, descriptive statistics describe and summarise the sample itself , while inferential statistics use the data from a sample to make inferences or predictions about a population .

Put another way, descriptive statistics help you understand your dataset , while inferential statistics help you make broader statements about the population , based on what you observe within the sample. If you’re keen to learn more, we cover inferential stats in another post , or you can check out the explainer video below.

Why do descriptive statistics matter?

While descriptive statistics are relatively simple from a mathematical perspective, they play a very important role in any research project . All too often, students skim over the descriptives and run ahead to the seemingly more exciting inferential statistics, but this can be a costly mistake.

The reason for this is that descriptive statistics help you, as the researcher, comprehend the key characteristics of your sample without getting lost in vast amounts of raw data. In doing so, they provide a foundation for your quantitative analysis . Additionally, they enable you to quickly identify potential issues within your dataset – for example, suspicious outliers, missing responses and so on. Just as importantly, descriptive statistics inform the decision-making process when it comes to choosing which inferential statistics you’ll run, as each inferential test has specific requirements regarding the shape of the data.

Long story short, it’s essential that you take the time to dig into your descriptive statistics before looking at more “advanced” inferentials. It’s also worth noting that, depending on your research aims and questions, descriptive stats may be all that you need in any case . So, don’t discount the descriptives! 

Free Webinar: Research Methodology 101

The “Big 7” descriptive statistics

With the what and why out of the way, let’s take a look at the most common descriptive statistics. Beyond the counts, proportions and percentages we mentioned earlier, we have what we call the “Big 7” descriptives. These can be divided into two categories – measures of central tendency and measures of dispersion.

Measures of central tendency

True to the name, measures of central tendency describe the centre or “middle section” of a dataset. In other words, they provide some indication of what a “typical” data point looks like within a given dataset. The three most common measures are:

The mean , which is the mathematical average of a set of numbers – in other words, the sum of all numbers divided by the count of all numbers. 
The median , which is the middlemost number in a set of numbers, when those numbers are ordered from lowest to highest.
The mode , which is the most frequently occurring number in a set of numbers (in any order). Naturally, a dataset can have one mode, no mode (no number occurs more than once) or multiple modes.

To make this a little more tangible, let’s look at a sample dataset, along with the corresponding mean, median and mode. This dataset reflects the service ratings (on a scale of 1 – 10) from 15 customers.

Example set of descriptive stats

As you can see, the mean of 5.8 is the average rating across all 15 customers. Meanwhile, 6 is the median . In other words, if you were to list all the responses in order from low to high, Customer 8 would be in the middle (with their service rating being 6). Lastly, the number 5 is the most frequent rating (appearing 3 times), making it the mode.

Together, these three descriptive statistics give us a quick overview of how these customers feel about the service levels at this business. In other words, most customers feel rather lukewarm and there’s certainly room for improvement. From a more statistical perspective, this also means that the data tend to cluster around the 5-6 mark , since the mean and the median are fairly close to each other.

To take this a step further, let’s look at the frequency distribution of the responses . In other words, let’s count how many times each rating was received, and then plot these counts onto a bar chart.

Example frequency distribution of descriptive stats

As you can see, the responses tend to cluster toward the centre of the chart , creating something of a bell-shaped curve. In statistical terms, this is called a normal distribution .

As you delve into quantitative data analysis, you’ll find that normal distributions are very common , but they’re certainly not the only type of distribution. In some cases, the data can lean toward the left or the right of the chart (i.e., toward the low end or high end). This lean is reflected by a measure called skewness , and it’s important to pay attention to this when you’re analysing your data, as this will have an impact on what types of inferential statistics you can use on your dataset.

Example of skewness

Measures of dispersion

While the measures of central tendency provide insight into how “centred” the dataset is, it’s also important to understand how dispersed that dataset is . In other words, to what extent the data cluster toward the centre – specifically, the mean. In some cases, the majority of the data points will sit very close to the centre, while in other cases, they’ll be scattered all over the place. Enter the measures of dispersion, of which there are three:

Range , which measures the difference between the largest and smallest number in the dataset. In other words, it indicates how spread out the dataset really is.

Variance , which measures how much each number in a dataset varies from the mean (average). More technically, it calculates the average of the squared differences between each number and the mean. A higher variance indicates that the data points are more spread out , while a lower variance suggests that the data points are closer to the mean.

Standard deviation , which is the square root of the variance . It serves the same purposes as the variance, but is a bit easier to interpret as it presents a figure that is in the same unit as the original data . You’ll typically present this statistic alongside the means when describing the data in your research.

Again, let’s look at our sample dataset to make this all a little more tangible.

descriptive statistics in business research

As you can see, the range of 8 reflects the difference between the highest rating (10) and the lowest rating (2). The standard deviation of 2.18 tells us that on average, results within the dataset are 2.18 away from the mean (of 5.8), reflecting a relatively dispersed set of data .

For the sake of comparison, let’s look at another much more tightly grouped (less dispersed) dataset.

Example of skewed data

As you can see, all the ratings lay between 5 and 8 in this dataset, resulting in a much smaller range, variance and standard deviation . You might also notice that the data are clustered toward the right side of the graph – in other words, the data are skewed. If we calculate the skewness for this dataset, we get a result of -0.12, confirming this right lean.

In summary, range, variance and standard deviation all provide an indication of how dispersed the data are . These measures are important because they help you interpret the measures of central tendency within context . In other words, if your measures of dispersion are all fairly high numbers, you need to interpret your measures of central tendency with some caution , as the results are not particularly centred. Conversely, if the data are all tightly grouped around the mean (i.e., low dispersion), the mean becomes a much more “meaningful” statistic).

Key Takeaways

We’ve covered quite a bit of ground in this post. Here are the key takeaways:

  • Descriptive statistics, although relatively simple, are a critically important part of any quantitative data analysis.
  • Measures of central tendency include the mean (average), median and mode.
  • Skewness indicates whether a dataset leans to one side or another
  • Measures of dispersion include the range, variance and standard deviation

If you’d like hands-on help with your descriptive statistics (or any other aspect of your research project), check out our private coaching service , where we hold your hand through each step of the research journey. 

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

ed

Good day. May I ask about where I would be able to find the statistics cheat sheet?

Khan

Right above you comment 🙂

Laarbik Patience

Good job. you saved me

Lou

Brilliant and well explained. So much information explained clearly!

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on July 9, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalizable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, other interesting articles, frequently asked questions about descriptive statistics.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

descriptive statistics in business research

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or percentages. This is called a frequency distribution .

  • Simple frequency distribution table
  • Grouped frequency distribution table
Gender Number
Male 182
Female 235
Other 27

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Library visits in the past year Percent
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%

Measures of central tendency estimate the center, or average, of a data set. The mean, median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

Mean number of library visits
Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses = 6
Mean Divide the sum of values by to find : 57/6 =

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then , the median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 =

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Mode number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response:

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s or SD ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.
Raw data Deviation from mean Squared deviation
15 15 – 9.5 = 5.5 30.25
3 3 – 9.5 = -6.5 42.25
12 12 – 9.5 = 2.5 6.25
0 0 – 9.5 = -9.5 90.25
24 24 – 9.5 = 14.5 210.25
3 3 – 9.5 = -6.5 42.25
= 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

Visits to the library
6
Mean 9.5
Median 7.5
Mode 3
Standard deviation 9.18
Variance 84.3
Range 24

If you were to only consider the mean as a measure of central tendency, your impression of the “middle” of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to outliers , you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read “across” the table to see how the independent and dependent variables relate to each other.

Number of visits to the library in the past year
Group 0–4 5–8 9–12 13–16 17+
Children 32 68 37 23 22
Adults 36 48 43 83 25

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

Visits to the library in the past year (Percentages)
Group 0–4 5–8 9–12 13–16 17+
Children 18% 37% 20% 13% 12% 182
Adults 15% 20% 18% 35% 11% 235

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables . It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Statistical power
  • Pearson correlation
  • Degrees of freedom
  • Statistical significance

Methodology

  • Cluster sampling
  • Stratified sampling
  • Focus group
  • Systematic review
  • Ethnography
  • Double-Barreled Question

Research bias

  • Implicit bias
  • Publication bias
  • Cognitive bias
  • Placebo effect
  • Pygmalion effect
  • Hindsight bias
  • Overconfidence bias

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarize only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved August 17, 2024, from https://www.scribbr.com/statistics/descriptive-statistics/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, central tendency | understanding the mean, median & mode, variability | calculating range, iqr, variance, standard deviation, inferential statistics | an easy introduction & examples, what is your plagiarism score.

  • Privacy Policy

Research Method

Home » Descriptive Statistics – Types, Methods and Examples

Descriptive Statistics – Types, Methods and Examples

Table of Contents

Descriptive Statistics

Descriptive Statistics

Descriptive statistics is a branch of statistics that deals with the summarization and description of collected data. This type of statistics is used to simplify and present data in a manner that is easy to understand, often through visual or numerical methods. Descriptive statistics is primarily concerned with measures of central tendency, variability, and distribution, as well as graphical representations of data.

Here are the main components of descriptive statistics:

  • Measures of Central Tendency : These provide a summary statistic that represents the center point or typical value of a dataset. The most common measures of central tendency are the mean (average), median (middle value), and mode (most frequent value).
  • Measures of Dispersion or Variability : These provide a summary statistic that represents the spread of values in a dataset. Common measures of dispersion include the range (difference between the highest and lowest values), variance (average of the squared differences from the mean), standard deviation (square root of the variance), and interquartile range (difference between the upper and lower quartiles).
  • Measures of Position : These are used to understand the distribution of values within a dataset. They include percentiles and quartiles.
  • Graphical Representations : Data can be visually represented using various methods like bar graphs, histograms, pie charts, box plots, and scatter plots. These visuals provide a clear, intuitive way to understand the data.
  • Measures of Association : These measures provide insight into the relationships between variables in the dataset, such as correlation and covariance.

Descriptive Statistics Types

Descriptive statistics can be classified into two types:

Measures of Central Tendency

These measures help describe the center point or average of a data set. There are three main types:

  • Mean : The average value of the dataset, obtained by adding all the data points and dividing by the number of data points.
  • Median : The middle value of the dataset, obtained by ordering all data points and picking out the one in the middle (or the average of the two middle numbers if the dataset has an even number of observations).
  • Mode : The most frequently occurring value in the dataset.

Measures of Variability (or Dispersion)

These measures describe the spread or variability of the data points in the dataset. There are four main types:

  • Range : The difference between the largest and smallest values in the dataset.
  • Variance : The average of the squared differences from the mean.
  • Standard Deviation : The square root of the variance, giving a measure of dispersion that is in the same units as the original dataset.
  • Interquartile Range (IQR) : The range between the first quartile (25th percentile) and the third quartile (75th percentile), which provides a measure of variability that is resistant to outliers.

Descriptive Statistics Formulas

Sure, here are some of the most commonly used formulas in descriptive statistics:

Mean (μ or x̄) :

The average of all the numbers in the dataset. It is computed by summing all the observations and dividing by the number of observations.

Formula : μ = Σx/n or x̄ = Σx/n (where Σx is the sum of all observations and n is the number of observations)

The middle value in the dataset when the observations are arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle numbers.

The most frequently occurring number in the dataset. There’s no formula for this as it’s determined by observation.

The difference between the highest (max) and lowest (min) values in the dataset.

Formula : Range = max – min

Variance (σ² or s²) :

The average of the squared differences from the mean. Variance is a measure of how spread out the numbers in the dataset are.

Population Variance formula : σ² = Σ(x – μ)² / N Sample Variance formula: s² = Σ(x – x̄)² / (n – 1)

(where x is each individual observation, μ is the population mean, x̄ is the sample mean, N is the size of the population, and n is the size of the sample)

Standard Deviation (σ or s) :

The square root of the variance. It measures the amount of variability or dispersion for a set of data. Population Standard Deviation formula: σ = √σ² Sample Standard Deviation formula: s = √s²

Interquartile Range (IQR) :

The range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). It measures statistical dispersion, or how far apart the data points are.

Formula : IQR = Q3 – Q1

Descriptive Statistics Methods

Here are some of the key methods used in descriptive statistics:

This method involves arranging data into a table format, making it easier to understand and interpret. Tables often show the frequency distribution of variables.

Graphical Representation

This method involves presenting data visually to help reveal patterns, trends, outliers, or relationships between variables. There are many types of graphs used, such as bar graphs, histograms, pie charts, line graphs, box plots, and scatter plots.

Calculation of Central Tendency Measures

This involves determining the mean, median, and mode of a dataset. These measures indicate where the center of the dataset lies.

Calculation of Dispersion Measures

This involves calculating the range, variance, standard deviation, and interquartile range. These measures indicate how spread out the data is.

Calculation of Position Measures

This involves determining percentiles and quartiles, which tell us about the position of particular data points within the overall data distribution.

Calculation of Association Measures

This involves calculating statistics like correlation and covariance to understand relationships between variables.

Summary Statistics

Often, a collection of several descriptive statistics is presented together in what’s known as a “summary statistics” table. This provides a comprehensive snapshot of the data at a glanc

Descriptive Statistics Examples

Descriptive Statistics Examples are as follows:

Example 1: Student Grades

Let’s say a teacher has the following set of grades for 7 students: 85, 90, 88, 92, 78, 88, and 94. The teacher could use descriptive statistics to summarize this data:

  • Mean (average) : (85 + 90 + 88 + 92 + 78 + 88 + 94)/7 = 88
  • Median (middle value) : First, rearrange the grades in ascending order (78, 85, 88, 88, 90, 92, 94). The median grade is 88.
  • Mode (most frequent value) : The grade 88 appears twice, more frequently than any other grade, so it’s the mode.
  • Range (difference between highest and lowest) : 94 (highest) – 78 (lowest) = 16
  • Variance and Standard Deviation : These would be calculated using the appropriate formulas, providing a measure of the dispersion of the grades.

Example 2: Survey Data

A researcher conducts a survey on the number of hours of TV watched per day by people in a particular city. They collect data from 1,000 respondents and can use descriptive statistics to summarize this data:

  • Mean : Calculate the average hours of TV watched by adding all the responses and dividing by the total number of respondents.
  • Median : Sort the data and find the middle value.
  • Mode : Identify the most frequently reported number of hours watched.
  • Histogram : Create a histogram to visually display the frequency of responses. This could show, for example, that the majority of people watch 2-3 hours of TV per day.
  • Standard Deviation : Calculate this to find out how much variation there is from the average.

Importance of Descriptive Statistics

Descriptive statistics are fundamental in the field of data analysis and interpretation, as they provide the first step in understanding a dataset. Here are a few reasons why descriptive statistics are important:

  • Data Summarization : Descriptive statistics provide simple summaries about the measures and samples you have collected. With a large dataset, it’s often difficult to identify patterns or tendencies just by looking at the raw data. Descriptive statistics provide numerical and graphical summaries that can highlight important aspects of the data.
  • Data Simplification : They simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary, making it easier to understand and interpret the dataset.
  • Identification of Patterns and Trends : Descriptive statistics can help identify patterns and trends in the data, providing valuable insights. Measures like the mean and median can tell you about the central tendency of your data, while measures like the range and standard deviation tell you about the dispersion.
  • Data Comparison : By summarizing data into measures such as the mean and standard deviation, it’s easier to compare different datasets or different groups within a dataset.
  • Data Quality Assessment : Descriptive statistics can help identify errors or outliers in the data, which might indicate issues with data collection or entry.
  • Foundation for Further Analysis : Descriptive statistics are typically the first step in data analysis. They help create a foundation for further statistical or inferential analysis. In fact, advanced statistical techniques often assume that one has first examined their data using descriptive methods.

When to use Descriptive Statistics

They can be used in a wide range of situations, including:

  • Understanding a New Dataset : When you first encounter a new dataset, using descriptive statistics is a useful first step to understand the main characteristics of the data, such as the central tendency, dispersion, and distribution.
  • Data Exploration in Research : In the initial stages of a research project, descriptive statistics can help to explore the data, identify trends and patterns, and generate hypotheses for further testing.
  • Presenting Research Findings : Descriptive statistics can be used to present research findings in a clear and understandable way, often using visual aids like graphs or charts.
  • Monitoring and Quality Control : In fields like business or manufacturing, descriptive statistics are often used to monitor processes, track performance over time, and identify any deviations from expected standards.
  • Comparing Groups : Descriptive statistics can be used to compare different groups or categories within your data. For example, you might want to compare the average scores of two groups of students, or the variance in sales between different regions.
  • Reporting Survey Results : If you conduct a survey, you would use descriptive statistics to summarize the responses, such as calculating the percentage of respondents who agree with a certain statement.

Applications of Descriptive Statistics

Descriptive statistics are widely used in a variety of fields to summarize, represent, and analyze data. Here are some applications:

  • Business : Businesses use descriptive statistics to summarize and interpret data such as sales figures, customer feedback, or employee performance. For instance, they might calculate the mean sales for each month to understand trends, or use graphical representations like bar charts to present sales data.
  • Healthcare : In healthcare, descriptive statistics are used to summarize patient data, such as age, weight, blood pressure, or cholesterol levels. They are also used to describe the incidence and prevalence of diseases in a population.
  • Education : Educators use descriptive statistics to summarize student performance, like average test scores or grade distribution. This information can help identify areas where students are struggling and inform instructional decisions.
  • Social Sciences : Social scientists use descriptive statistics to summarize data collected from surveys, experiments, and observational studies. This can involve describing demographic characteristics of participants, response frequencies to survey items, and more.
  • Psychology : Psychologists use descriptive statistics to describe the characteristics of their study participants and the main findings of their research, such as the average score on a psychological test.
  • Sports : Sports analysts use descriptive statistics to summarize athlete and team performance, such as batting averages in baseball or points per game in basketball.
  • Government : Government agencies use descriptive statistics to summarize data about the population, such as census data on population size and demographics.
  • Finance and Economics : In finance, descriptive statistics can be used to summarize past investment performance or economic data, such as changes in stock prices or GDP growth rates.
  • Quality Control : In manufacturing, descriptive statistics can be used to summarize measures of product quality, such as the average dimensions of a product or the frequency of defects.

Limitations of Descriptive Statistics

While descriptive statistics are a crucial part of data analysis and provide valuable insights about a dataset, they do have certain limitations:

  • Lack of Depth : Descriptive statistics provide a summary of your data, but they can oversimplify the data, resulting in a loss of detail and potentially significant nuances.
  • Vulnerability to Outliers : Some descriptive measures, like the mean, are sensitive to outliers. A single extreme value can significantly skew your mean, making it less representative of your data.
  • Inability to Make Predictions : Descriptive statistics describe what has been observed in a dataset. They don’t allow you to make predictions or generalizations about unobserved data or larger populations.
  • No Insight into Correlations : While some descriptive statistics can hint at potential relationships between variables, they don’t provide detailed insights into the nature or strength of these relationships.
  • No Causality or Hypothesis Testing : Descriptive statistics cannot be used to determine cause and effect relationships or to test hypotheses. For these purposes, inferential statistics are needed.
  • Can Mislead : When used improperly, descriptive statistics can be used to present a misleading picture of the data. For instance, choosing to only report the mean without also reporting the standard deviation or range can hide a large amount of variability in the data.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

Regression Analysis

Regression Analysis – Methods, Types and Examples

Documentary Analysis

Documentary Analysis – Methods, Applications and...

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Correlation Analysis

Correlation Analysis – Types, Methods and...

Graphical Methods

Graphical Methods – Types, Examples and Guide

  • Search Search Please fill out this field.

What Are Descriptive Statistics?

  • How They Work

Univariate vs. Bivariate

Descriptive statistics and visualizations, descriptive statistics and outliers.

  • Descriptive vs. Inferential

The Bottom Line

  • Corporate Finance
  • Financial Analysis

Descriptive Statistics: Definition, Overview, Types, and Examples

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

descriptive statistics in business research

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean , median , and mode , while measures of variability include standard deviation , variance , minimum and maximum variables, kurtosis , and skewness .

Key Takeaways

  • Descriptive statistics summarizes or describes the characteristics of a data set.
  • Descriptive statistics consists of three basic categories of measures: measures of central tendency, measures of variability (or spread), and frequency distribution.
  • Measures of central tendency describe the center of the data set (mean, median, mode).
  • Measures of variability describe the dispersion of the data set (variance, standard deviation).
  • Measures of frequency distribution describe the occurrence of data within the data set (count).

Jessica Olah

Understanding Descriptive Statistics

Descriptive statistics help describe and explain the features of a specific data set by giving short summaries about the sample and measures of the data. The most recognized types of descriptive statistics are measures of center. For example, the mean, median, and mode, which are used at almost all levels of math and statistics, are used to define and describe a data set. The mean, or the average, is calculated by adding all the figures within the data set and then dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most often, and the median is the figure situated in the middle of the data set. It is the figure separating the higher figures from the lower figures within a data set. However, there are less common types of descriptive statistics that are still very important.

People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large data set into bite-sized descriptions. A student's grade point average (GPA), for example, provides a good understanding of descriptive statistics. The idea of a GPA is that it takes data points from a range of individual course grades, and averages them together to provide a general understanding of a student's overall academic performance. A student's personal GPA reflects their mean academic performance.

Descriptive statistics, especially in fields such as medicine, often visually depict data using scatter plots, histograms, line graphs, or stem and leaf displays. We'll talk more about visuals later in this article.

Types of Descriptive Statistics

All descriptive statistics are either measures of central tendency or measures of variability , also known as measures of dispersion.

Central Tendency

Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables, and general discussions to help people understand the meaning of the analyzed data.

Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set.

Measures of Variability

Measures of variability (or measures of spread) aid in analyzing how dispersed the distribution is for a set of data. For example, while the measures of central tendency may give a person the average of a data set, it does not describe how the data is distributed within the set.

So while the average of the data might be 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set. Range, quartiles , absolute deviation, and variance are all examples of measures of variability.

Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100).

Distribution

Distribution (or frequency distribution) refers to the number of times a data point occurs. Alternatively, it can be how many times a data point fails to occur. Consider this data set: male, male, female, female, female, other. The distribution of this data can be classified as:

  • The number of males in the data set is 2.
  • The number of females in the data set is 3.
  • The number of individuals identifying as other is 1.
  • The number of non-males is 4.

In descriptive statistics, univariate data analyzes only one variable. It is used to identify characteristics of a single trait and is not used to analyze any relationships or causations.

For example, imagine a room full of high school students. Say you wanted to gather the average age of the individuals in the room. This univariate data is only dependent on one factor: each person's age. By gathering this one piece of information from each person and dividing by the total number of people, you can determine the average age.

Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two types of data are collected, and the relationship between the two pieces of information is analyzed together. Because multiple variables are analyzed, this approach may also be referred to as multivariate .

Let's say each high school student in the example above takes a college assessment test, and we want to see whether older students are testing better than younger students. In addition to gathering the ages of the students, we need to find out each student's test score. Then, using data analytics, we mathematically or graphically depict whether there is a relationship between student age and test scores.

The preparation and reporting of financial statements is an example of descriptive statistics. Analyzing that financial information to make decisions on the future is inferential statistics.

One essential aspect of descriptive statistics is graphical representation. Visualizing data distributions effectively can be incredibly powerful, and this is done in several ways.

Histograms are tools for displaying the distribution of numerical data. They divide the data into bins or intervals and represent the frequency or count of data points falling into each bin through bars of varying heights. Histograms help identify the shape of the distribution, central tendency, and variability of the data.

Another visualization is boxplots. Boxplots, also known as box-and-whisker plots, provide a concise summary of a data distribution by highlighting key summary statistics including the median (middle line inside the box), quartiles (edges of the box), and potential outliers (points outside, or the "whiskers"). Boxplots visually depict the spread and skewness of the data and are particularly useful for comparing distributions across different groups or variables.

Whenever descriptive statistics are being discussed, it's important to note outliers. Outliers are data points that significantly differ from other observations in a dataset. These could be errors, anomalies, or rare events within the data.

Detecting and managing outliers is a step in descriptive statistics to ensure accurate and reliable data analysis. To identify outliers, you can use graphical techniques (such as boxplots or scatter plots) or statistical methods (such as Z-score or IQR method). These approaches help pinpoint observations that deviate substantially from the overall pattern of the data.

The presence of outliers can have a notable impact on descriptive statistics, skewing results and affecting the interpretation of data. Outliers can disproportionately influence measures of central tendency, such as the mean, pulling it towards their extreme values. For example, the dataset of (1, 1, 1, 997) is 250, even though that is hardly representative of the dataset. This distortion can lead to misleading conclusions about the typical behavior of the dataset.

Depending on the context, outliers can often be treated by removing them (if they are genuinely erroneous or irrelevant). Alternatively, outliers may hold important information and should be kept for the value they may be able to demonstrate. As you analyze your data, consider the relevance of what outliers can contribute and whether it makes more sense to just strike those data points from your descriptive statistic calculations.

Descriptive Statistics vs. Inferential Statistics

Descriptive statistics have a different function from inferential statistics, which are data sets that are used to make decisions or apply characteristics from one data set to another.

Imagine another example where a company sells hot sauce. The company gathers data such as the count of sales , average quantity purchased per transaction , and average sale per day of the week. All of this information is descriptive, as it tells a story of what actually happened in the past. In this case, it is not being used beyond being informational.

Now let's say that the company wants to roll out a new hot sauce. It gathers the same sales data above, but it uses the information to make predictions about what the sales of the new hot sauce will be. The act of using descriptive statistics and applying characteristics to a different data set makes the data set inferential statistics. We are no longer simply summarizing data; we are using it to predict what will happen regarding an entirely different body of data (in this case, the new hot sauce product).

What Is Descriptive Statistics?

Descriptive statistics is a means of describing features of a data set by generating summaries about data samples. For example, a population census may include descriptive statistics regarding the ratio of men and women in a specific city.

What Are Examples of Descriptive Statistics?

In recapping a Major League Baseball season, for example, descriptive statistics might include team batting averages, the number of runs allowed per team, and the average wins per division.

What Is the Main Purpose of Descriptive Statistics?

The main purpose of descriptive statistics is to provide information about a data set. In the example above, there are dozens of baseball teams, hundreds of players, and thousands of games. Descriptive statistics summarizes large amounts of data into useful bits of information.

What Are the Types of Descriptive Statistics?

The three main types of descriptive statistics are frequency distribution, central tendency, and variability of a data set. The frequency distribution records how often data occurs, central tendency records the data's center point of distribution, and variability of a data set records its degree of dispersion.

Can Descriptive Statistics Be Used to Make Inferences or Predictions?

Technically speaking, descriptive statistics only serves to help understand historical data attributes. Inferential statistics—a separate branch of statistics—is used to understand how variables interact with one another in a data set and possibly predict what might happen in the future.

Descriptive statistics refers to the analysis, summary, and communication of findings that describe a data set. Often not useful for decision-making, descriptive statistics still hold value in explaining high-level summaries of a set of information such as the mean, median, mode, variance, range, and count of information.

Purdue Online Writing Lab. " Writing with Statistics: Descriptive Statistics ."

National Library of Medicine. " Descriptive Statistics for Summarizing Data ."

CSUN.edu. " Measures of Variability, Descriptive Statistics Part 2 ."

Math.Kent.edu. " Summary: Differences Between Univariate and Bivariate Data ."

Purdue Online Writing Lab. " Writing with Statistics: Basic Inferential Statistics: Theory and Application ."

descriptive statistics in business research

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

What is Descriptive Statistics? Definition, Types, Examples

Appinio Research · 23.11.2023 · 39min read

What is Descriptive Statistics Definition Types Examples

Have you ever wondered how we make sense of the vast sea of data surrounding us? In a world overflowing with information, the ability to distill complex datasets into meaningful insights is a skill of immense importance.

This guide will equip you with the knowledge and tools to unravel the stories hidden within data. Whether you're a data analyst, a researcher, a business professional, or simply curious about the art of data interpretation, this guide will demystify the fundamental concepts and techniques of descriptive statistics, empowering you to explore, understand, and communicate data like a seasoned expert.

What is Descriptive Statistics?

Descriptive statistics  refers to a set of mathematical and graphical tools used to summarize and describe essential features of a dataset. These statistics provide a clear and concise representation of data, enabling researchers, analysts, and decision-makers to gain valuable insights, identify patterns, and understand the characteristics of the information at hand.

Purpose of Descriptive Statistics

The primary purpose of descriptive statistics is to simplify and condense complex data into manageable, interpretable summaries. Descriptive statistics serve several key objectives:

  • Data Summarization:  They provide a compact summary of the main characteristics of a dataset, allowing individuals to grasp the essential features quickly.
  • Data Visualization:  Descriptive statistics often accompany visual representations, such as histograms, box plots, and bar charts, making it easier to interpret and communicate data trends and distributions.
  • Data Exploration:  They facilitate the exploration of data to identify outliers, patterns, and potential areas of interest or concern.
  • Data Comparison:  Descriptive statistics enable the comparison of datasets, groups, or variables, aiding in decision-making and hypothesis testing.
  • Informed Decision-Making:  By providing a clear understanding of data, descriptive statistics support informed decision-making across various domains, including business, healthcare, social sciences, and more.

Importance of Descriptive Statistics in Data Analysis

Descriptive statistics play a pivotal role in data analysis by providing a foundation for understanding, summarizing, and interpreting data. Their importance is underscored by their widespread use in diverse fields and industries.

Here are key reasons why descriptive statistics are crucial in data analysis:

  • Data Simplification:  Descriptive statistics simplify complex datasets, making them more accessible to analysts and decision-makers. They condense extensive information into concise metrics and visual representations.
  • Initial Data Assessment:  Descriptive statistics are often the first step in data analysis. They help analysts gain a preliminary understanding of the data's characteristics and identify potential areas for further investigation.
  • Data Visualization:  Descriptive statistics are often paired with visualizations, enhancing data interpretation. Visual representations, such as histograms and scatter plots, provide intuitive insights into data patterns.
  • Communication and Reporting:  Descriptive statistics serve as a common language for conveying data insights to a broader audience. They are instrumental in research reports, presentations, and data-driven decision-making.
  • Quality Control:  In manufacturing and quality control processes, descriptive statistics help monitor and maintain product quality by identifying deviations from desired standards.
  • Risk Assessment:  In finance and insurance, descriptive statistics, such as standard deviation and variance, are used to assess and manage risk associated with investments and policies.
  • Healthcare Decision-Making:  Descriptive statistics inform healthcare professionals about patient demographics , treatment outcomes, and disease prevalence, aiding in clinical decision-making and healthcare policy formulation.
  • Market Analysis :  In marketing and consumer research, descriptive statistics reveal customer preferences, market trends, and product performance, guiding marketing strategies and product development .
  • Scientific Research:  In scientific research, descriptive statistics are fundamental for summarizing experimental results, comparing groups, and identifying meaningful patterns in data.
  • Government and Policy:  Government agencies use descriptive statistics to collect and analyze data on demographics, economics, and social trends to inform policy decisions and resource allocation.

Descriptive statistics serve as a critical foundation for effective data analysis and decision-making across a wide range of disciplines. They empower individuals and organizations to extract meaningful insights from data, enabling more informed and evidence-based choices.

Data Collection and Preparation

First, let's delve deeper into the crucial initial data collection and preparation steps. These initial stages lay the foundation for effective descriptive statistics.

Data Sources

When embarking on a data analysis journey, you must first identify your data sources. These sources can be categorized into two main types:

  • Primary Data :  This data is collected directly from original sources. It includes surveys, experiments, and observations tailored to your specific research objectives. Primary data offers high relevance and control over the data collection process.
  • Secondary Data :  Secondary data, on the other hand, is data that already exists and has been collected by someone else for a different purpose. It can include publicly available datasets, reports, and databases. Secondary data can save time and resources but may not always align perfectly with your research needs.

Understanding the nature of your data is fundamental. Data can be classified into two primary types:

  • Quantitative Data :  Quantitative data consists of numeric values and is often used for measurements and calculations. Examples include age, income, temperature, and test scores. Quantitative data can further be categorized as discrete (countable) or continuous (measurable).
  • Qualitative Data :  Qualitative data, also known as categorical data, represents categories or labels and cannot be measured numerically. Examples include gender, color, and product categories. Qualitative data can be nominal (categories with no specific order) or ordinal (categories with a meaningful order).

Data Cleaning and Preprocessing

Once you have your data in hand, preparing it for analysis is essential. Data cleaning and preprocessing involve several critical steps:

Handling Missing Data

Missing data can significantly impact your analysis. There are various approaches to address missing values:

  • Deletion:  You can remove rows or columns with missing data, but this may lead to a loss of valuable information.
  • Imputation:  Imputing missing values involves estimating or filling in the missing data using methods such as mean imputation, median imputation, or advanced techniques like regression imputation.

Outlier Detection

Outliers are data points that deviate significantly from the rest of the data. Detecting and handling outliers is crucial to prevent them from skewing your results. Popular methods for identifying outliers include box plots and z-scores.

Data Transformation

Data transformation aims to normalize or standardize the data to make it more suitable for analysis. Common transformations include:

  • Normalization:  Scaling data to a standard range, often between 0 and 1.
  • Standardization:  Transforming data to have a mean of 0 and a standard deviation of 1.

Data Organization and Presentation

Organizing and presenting your data effectively is essential for meaningful analysis and communication. Here's how you can achieve this:

Data Tables

Data tables are a straightforward way to present your data, especially when dealing with smaller datasets. They allow you to list data in rows and columns, making it easy to review and perform basic calculations.

Graphs and Charts

Visualizations play a pivotal role in conveying the message hidden within your data. Some common types of graphs and charts include:

  • Histograms:  Histograms display the distribution of continuous data by dividing it into intervals or bins and showing the frequency of data points within each bin.
  • Bar Charts:  Bar charts are excellent for representing categorical or discrete data . They display categories on one axis and corresponding values on the other.
  • Line Charts:  Line charts are useful for identifying trends over time, making them suitable for time series data.
  • Scatter Plots:  Scatter plots help visualize the relationship between two variables, making them valuable for identifying correlations.
  • Pie Charts:  Pie charts are suitable for displaying the composition of a whole in terms of its parts, often as percentages.

Summary Statistics

Calculating summary statistics, such as the mean, median, and standard deviation, provides a quick snapshot of your data's central tendencies and variability.

When it comes to data collection and visualization, Appinio offers a seamless solution that simplifies the process. In Appinio, creating interactive visualizations is the easiest way to understand and present your data effectively. These visuals help you uncover insights and patterns within your data, making it a valuable tool for anyone seeking to make data-driven decisions.

Book a demo today to explore how Appinio can enhance your data collection and visualization efforts, ultimately empowering your decision-making process!

Book a Demo

Measures of Central Tendency

Measures of central tendency are statistics that provide insight into the central or typical value of a dataset. They help you understand where the data tends to cluster, which is crucial for drawing meaningful conclusions.

The mean, also known as the average, is the most widely used measure of central tendency. It is calculated by summing all the values in a dataset and then dividing by the total number of values. The formula for the mean (μ) is:

μ = (Σx) / N
  • μ represents the mean.
  • Σx represents the sum of all individual data points.
  • N is the total number of data points.

The mean is highly sensitive to outliers and extreme values in the dataset. It's an appropriate choice for normally distributed data.

The median is another measure of central tendency that is less influenced by outliers compared to the mean. To find the median, you first arrange the data in ascending or descending order and then locate the middle value. If there's an even number of data points, the median is the average of the two middle values.

For example, in the dataset [3, 5, 7, 8, 10], the median is 7.

The mode is the value that appears most frequently in a dataset. Unlike the mean and median, which are influenced by the actual values, the mode represents the data point with the highest frequency of occurrence.

In the dataset [3, 5, 7, 8, 8], the mode is 8.

Choosing the Right Measure

Selecting the appropriate measure of central tendency depends on the nature of your data and your research objectives:

  • Use the  mean  for normally distributed data without significant outliers.
  • Choose the  median  when dealing with skewed data or data with outliers.
  • The  mode  is most useful for categorical data  or nominal data .

Understanding these measures and when to apply them is crucial for accurate data analysis and interpretation.

Measures of Variability

The measures of variability provide insights into how spread out or dispersed your data is. These measures complement the central tendency measures discussed earlier and are essential for a comprehensive understanding of your dataset.

The range is the simplest measure of variability and is calculated as the difference between the maximum and minimum values in your dataset. It offers a quick assessment of the spread of your data.

Range = Maximum Value - Minimum Value

For example, consider a dataset of daily temperatures in Celsius for a month:

  • Maximum temperature: 30°C
  • Minimum temperature: 10°C

The range would be 30°C - 10°C = 20°C, indicating a 20-degree Celsius spread in temperature over the month.

Variance measures the average squared deviation of each data point from the mean. It quantifies the overall dispersion of data points. The formula for variance (σ²) is as follows:

σ² = Σ(x - μ)² / N
  • σ² represents the variance.
  • Σ represents the summation symbol.
  • x represents each individual data point.
  • μ is the mean of the dataset.

Calculating the variance involves the following:

  • Find the mean (μ) of the dataset.
  • For each data point, subtract the mean (x - μ).
  • Square the result for each data point [(x - μ)²].
  • Sum up all the squared differences [(Σ(x - μ)²)].
  • Divide by the total number of data points (N) to get the variance.

A higher variance indicates greater variability among data points, while a lower variance suggests data points are closer to the mean.

Standard Deviation

The standard deviation is a widely used measure of variability and is simply the square root of the variance. It provides a more interpretable value and is often preferred for reporting. The formula for standard deviation (σ) is:

Calculating the standard deviation follows the same process as variance but with an additional step of taking the square root of the variance. It represents the average deviation of data points from the mean in the same units as the data.

For example, if the variance is calculated as 16 (square units), the standard deviation would be 4 (the same units as the data). A smaller standard deviation indicates data points are closer to the mean, while a larger standard deviation indicates greater variability.

Interquartile Range (IQR)

The interquartile range (IQR) is a robust measure of variability that is less influenced by extreme values (outliers) than the range, variance, or standard deviation. It is based on the quartiles of the dataset. To calculate the IQR:

  • Arrange the data in ascending order.
  • Calculate the first quartile (Q1), which is the median of the lower half of the data.
  • Calculate the third quartile (Q3), which is the median of the upper half of the data.
  • Subtract Q1 from Q3 to find the IQR.
IQR = Q3 - Q1

The IQR represents the range within which the central 50% of your data falls. It provides valuable information about the middle spread of your dataset, making it a useful measure for skewed or non-normally distributed data.

Data Distribution

Understanding the distribution of your data is essential for making meaningful inferences and choosing appropriate statistical methods. In this section, we will explore different aspects of data distribution.

Normal Distribution

The normal distribution, also known as the Gaussian distribution or bell curve, is a fundamental concept in statistics. It is characterized by a symmetric, bell-shaped curve. In a normal distribution:

  • The mean, median, and mode are all equal and located at the center of the distribution.
  • Data points are evenly spread around the mean.
  • The distribution is defined by two parameters: mean (μ) and standard deviation (σ).

The normal distribution is essential in various statistical tests and modeling techniques. Many natural phenomena, such as heights and IQ scores, closely follow a normal distribution. It serves as a reference point for understanding other distributions and statistical analyses.

Skewness and Kurtosis

Skewness and kurtosis are measures that provide insights into the shape of a data distribution:

Skewness quantifies the asymmetry of a distribution. A distribution can be:

  • Positively Skewed (Right-skewed):  In a positively skewed distribution, the tail extends to the right, and the majority of data points are concentrated on the left side of the distribution. The mean is typically greater than the median.
  • Negatively Skewed (Left-skewed):  In a negatively skewed distribution, the tail extends to the left, and the majority of data points are concentrated on the right side of the distribution. The mean is typically less than the median.

Skewness is calculated using various formulas, including Pearson's first coefficient of skewness.

Kurtosis measures the "tailedness" of a distribution, indicating whether the distribution has heavy or light tails compared to a normal distribution. Kurtosis can be:

  • Leptokurtic:  A distribution with positive kurtosis has heavier tails and a more peaked central region than a normal distribution.
  • Mesokurtic:  A distribution with kurtosis equal to that of a normal distribution.
  • Platykurtic:  A distribution with negative kurtosis has lighter tails and a flatter central region than a normal distribution.

Kurtosis is calculated using different formulas, including the fourth standardized moment.

Understanding skewness and kurtosis helps you assess the departure of your data from normality and choose appropriate statistical methods.

Other Types of Distributions

While the normal distribution is prevalent, real-world data often follows different distributions. Some other types of distributions you may encounter include:

  • Exponential Distribution:  Commonly used for modeling the time between events in a Poisson process, such as arrival times in a queue.
  • Poisson Distribution:  Used for counting the number of events in a fixed interval of time or space, such as the number of phone calls received in an hour.
  • Binomial Distribution:  Suitable for modeling the number of successes in a fixed number of independent Bernoulli trials.
  • Lognormal Distribution:  Often used for data that is the product of many small, independent, positive factors, such as stock prices.
  • Uniform Distribution:  Represents a constant probability over a specified range of values, making all outcomes equally likely.

Understanding the characteristics and properties of these distributions is crucial for selecting appropriate statistical techniques and making accurate interpretations in various fields of study and data analysis.

Visualizing Data

Visualizing data is a powerful way to gain insights and understand the patterns and characteristics of your dataset. Below are several standard methods of data visualization.

Histograms  are a widely used graphical representation of the distribution of continuous data. They are particularly useful for understanding the shape of the data's frequency distribution. Here's how they work:

  • Data is divided into intervals, or "bins."
  • The number of data points falling into each bin is represented by the height of bars on a graph.
  • The bars are typically adjacent and do not have gaps between them.

Histograms help you visualize the central tendency, spread, and skewness of your data. They can reveal whether your data is normally distributed, skewed to the left or right, or exhibits multiple peaks.

Histograms are especially useful when you have a large dataset and want to quickly assess its distribution. They are commonly used in fields like finance to analyze stock returns, biology to study species distribution, and quality control to monitor manufacturing processes.

Box plots , also known as box-and-whisker plots, are excellent tools for visualizing the distribution of data, particularly for identifying outliers and comparing multiple datasets. Here's how they are constructed:

  • The box represents the interquartile range (IQR), with the lower edge of the box at the first quartile (Q1) and the upper edge at the third quartile (Q3).
  • A vertical line inside the box indicates the median (Q2).
  • Whiskers extend from the edges of the box to the minimum and maximum values within a certain range.
  • Outliers, which are data points significantly outside the whiskers, are often shown as individual points.

Box plots provide a concise summary of data distribution, including central tendency and variability. They are beneficial when comparing data distribution across different categories or groups.

Box plots are commonly used in fields like healthcare to compare patient outcomes by treatment, in education to assess student performance across schools, and in market research to analyze customer ratings for different products.

Scatter Plots

Scatter plots  are a valuable tool for visualizing the relationship between two continuous variables. They are handy for identifying patterns, trends, and correlations in data. Here's how they work:

  • Each data point is represented as a point on the graph, with one variable on the x-axis and the other on the y-axis.
  • The resulting plot shows the dispersion and clustering of data points, allowing you to assess the strength and direction of the relationship.

Scatter plots help you determine whether there is a positive, negative, or no correlation between the variables. Additionally, they can reveal outliers and influential data points that may affect the relationship.

Scatter plots are commonly used in fields like economics to analyze the relationship between income and education, environmental science to study the correlation between temperature and plant growth, and marketing to understand the relationship between advertising spend and sales.

Frequency Distributions

Frequency distributions  are a tabular way to organize and display categorical or discrete data. They show the count or frequency of each category within a dataset. Here's how to create a frequency distribution:

  • Identify the distinct categories or values in your dataset.
  • Count the number of occurrences of each category.
  • Organize the results in a table, with categories in one column and their respective frequencies in another.

Frequency distributions help you understand the distribution of categorical data, identify dominant categories, and detect any rare or uncommon values. They are commonly used in fields like marketing to analyze customer demographics, in education to assess student grades, and in social sciences to study survey responses.

Descriptive Statistics for Categorical Data

Categorical data requires its own set of descriptive statistics to gain insights into the distribution and characteristics of these non-numeric variables. There are various methods for describing categorical data.

Frequency Tables

Frequency tables , also known as contingency tables, summarize categorical data by displaying the count or frequency of each category within one or more variables. Here's how they are created:

  • List the categories or values of the categorical variable(s) in rows or columns.
  • Count the occurrences of each category and record the frequencies.

Frequency tables are best used for summarizing and comparing categorical data across different groups or dimensions. They provide a straightforward way to understand data distribution and identify patterns or associations.

For example, in a survey about favorite ice cream flavors , a frequency table might show how many respondents prefer vanilla, chocolate, strawberry, and other flavors.

Bar charts  are a common graphical representation of categorical data. They are similar to histograms but are used for displaying categorical variables. Here's how they work:

  • Categories are listed on one axis (usually the x-axis), while the corresponding frequencies or counts are shown on the other axis (usually the y-axis).
  • Bars are drawn for each category, with the height of each bar representing the frequency or count of that category.

Bar charts make it easy to compare the frequencies of different categories visually. They are especially helpful for presenting categorical data in a visually appealing and understandable way.

Bar charts are commonly used in fields like market research to display survey results, in social sciences to illustrate demographic information, and in business to show product sales by category.

Pie charts  are circular graphs that represent the distribution of categorical data as "slices of a pie." Here's how they are constructed:

  • Categories or values are represented as segments or slices of the pie, with each segment's size proportional to its frequency or count.

Pie charts are effective for showing the relative proportions of different categories within a dataset. They are instrumental when you want to emphasize the composition of a whole in terms of its parts.

Pie charts are commonly used in areas such as marketing to display market share, in finance to show budget allocations, and in demographics to illustrate the distribution of ethnic groups within a population.

These methods for visualizing and summarizing categorical data are essential for gaining insights into non-numeric variables and making informed decisions based on the distribution of categories within a dataset.

Descriptive Statistics Summary and Interpretation

Summarizing and interpreting descriptive statistics gives you the skills to extract meaningful insights from your data and apply them to real-world scenarios.

Summarizing Descriptive Statistics

Once you've collected and analyzed your data using descriptive statistics, the next step is to summarize the findings. This involves condensing the wealth of information into a few key points:

  • Central Tendency:  Summarize the central tendency of your data. If it's a numeric dataset, mention the mean, median, and mode as appropriate. For categorical data, highlight the most frequent categories.
  • Variability:  Describe the spread of the data using measures like range, variance, and standard deviation. Discuss whether the data is tightly clustered or widely dispersed.
  • Distribution:  Mention the shape of the data distribution. Is it normal, skewed, or bimodal? Use histograms or box plots to illustrate the distribution visually.
  • Outliers:  Identify any outliers and discuss their potential impact on the analysis. Consider whether outliers should be treated or investigated further.
  • Key Observations: Highlight any notable observations or patterns that emerged during your analysis. Are there clear trends or interesting findings in the data?

Interpreting Descriptive Statistics

Interpreting descriptive statistics involves making sense of the numbers and metrics you've calculated. It's about understanding what the data is telling you about the underlying phenomenon. Here are some steps to guide your interpretation:

  • Context Matters:  Always consider the context of your data. What does a specific value or pattern mean in the real-world context of your study? For example, a mean salary value may vary significantly depending on the industry.
  • Comparisons:  If you have multiple datasets or groups, compare their descriptive statistics. Are there meaningful differences or similarities between them? Statistical tests may be needed for formal comparisons.
  • Correlations:  If you've used scatter plots to visualize relationships, interpret the direction and strength of correlations. Are variables positively or negatively correlated, or is there no clear relationship?
  • Causation:  Be cautious about inferring causation from descriptive statistics alone. Correlation does not imply causation, so consider additional research or experimentation to establish causal relationships.
  • Consider Outliers:  If you have outliers, assess their impact on the overall interpretation. Do they represent genuine data points or measurement errors?

Descriptive Statistics Examples

To better understand how descriptive statistics are applied in real-world scenarios, let's explore a range of practical examples across various fields and industries. These examples illustrate how descriptive statistics provide valuable insights and inform decision-making processes.

Financial Analysis

Example:  Investment Portfolio Analysis

Description:  An investment analyst is tasked with evaluating the performance of a portfolio of stocks over the past year. They collect daily returns for each stock and want to provide a comprehensive summary of the portfolio's performance.

Use of Descriptive Statistics:

  • Central Tendency:  Calculate the portfolio's average daily return (mean) to assess its overall performance during the year.
  • Variability:  Compute the portfolio's standard deviation to measure the risk or volatility associated with the investment.
  • Distribution:  Create a histogram to visualize the distribution of daily returns, helping the analyst understand the nature of the portfolio's gains and losses.
  • Outliers:  Identify any outliers in daily returns that may require further investigation.

The resulting descriptive statistics will guide the analyst in making recommendations to investors, such as adjusting the portfolio composition to manage risk or improve returns.

Example:  Hospital Patient Demographics

Description:  A hospital administrator wants to understand the demographics of patients admitted to their facility over the past year. They have data on patient age, gender, and medical conditions.

  • Central Tendency:  Calculate the average age of patients to assess the typical age of admissions.
  • Variability:  Compute the standard deviation of patient ages to understand how age varies among patients.
  • Distribution:  Create bar charts or pie charts to visualize the gender distribution of patients and frequency tables to analyze the prevalence of different medical conditions.
  • Key Observations:  Identify any trends, such as seasonal variations in admissions or common medical conditions among specific age groups.

These descriptive statistics help the hospital administration allocate resources effectively, plan for future patient needs, and tailor healthcare services to the demographics of their patient population.

Marketing Research

Example:  Product Sales Analysis

Description:  A marketing team wants to evaluate the sales performance of different products in their product line. They have monthly sales data for the past two years.

  • Central Tendency:  Calculate the mean monthly sales for each product to determine their average performance.
  • Variability:  Compute the standard deviation of monthly sales to identify products with the most variable sales.
  • Distribution:  Create box plots to visualize the sales distribution for each product, helping to understand the range and variability.
  • Comparisons:  Compare sales trends over the two years for each product to identify growth or decline patterns.

Descriptive statistics allow the marketing team to make informed decisions about product marketing strategies, inventory management, and product development.

Social Sciences

Example:  Survey Analysis on Happiness Levels

Description:  A sociologist conducts a survey to assess the happiness levels of residents in different neighborhoods within a city. Respondents rate their happiness on a scale of 1 to 10.

  • Central Tendency:  Calculate the mean happiness score for each neighborhood to identify areas with higher or lower average happiness levels.
  • Variability:  Compute the standard deviation of happiness scores to understand the degree of variation within each neighborhood.
  • Distribution:  Create histograms to visualize the distribution of happiness scores, revealing whether happiness levels are normally distributed or skewed.
  • Comparisons:  Compare the happiness levels across neighborhoods to identify potential factors influencing happiness disparities.

Descriptive statistics help sociologists pinpoint areas that may require interventions to improve residents' overall well-being and identify potential research directions.

These examples demonstrate how descriptive statistics play a vital role in summarizing and interpreting data across diverse domains. By applying these statistical techniques, professionals can make data-driven decisions, identify trends and patterns, and gain valuable insights into various aspects of their work.

Common Descriptive Statistics Mistakes and Pitfalls

While descriptive statistics are valuable tools, they can be misused or misinterpreted if not handled carefully. Here are some common mistakes and pitfalls to avoid when working with descriptive statistics.

Misinterpretation of Descriptive Statistics

  • Assuming Causation:  One of the most common mistakes is inferring causation from correlation . Just because two variables are correlated does not mean that one causes the other. Always be cautious about drawing causal relationships from descriptive statistics alone.
  • Ignoring Context:  Failing to consider the context of the data can lead to misinterpretation. A descriptive statistic may seem significant, but it might not have practical relevance in the specific context of your study.
  • Neglecting Outliers:  Ignoring outliers or treating them as errors without investigation can lead to incomplete and inaccurate conclusions. Outliers may hold valuable information or reveal unusual phenomena.
  • Overlooking Distribution Assumptions:  When applying statistical tests or methods, it's important to check whether your data meets the assumptions of those techniques. For example, using methods designed for normally distributed data on skewed data can yield misleading results.

Data Reporting Errors

  • Inadequate Data Documentation:  Failing to provide clear documentation about data sources, collection methods, and preprocessing steps can make it challenging for others to replicate your analysis or verify your findings.
  • Mislabeling Variables:  Accurate labeling of variables and units is crucial. Mislabeling or using inconsistent units can lead to erroneous calculations and interpretations.
  • Failure to Report Measures of Uncertainty:  Descriptive statistics provide point estimates of central tendency and variability. It's crucial to report measures of uncertainty, such as confidence intervals or standard errors, to convey the range of possible values.

Avoiding Biases in Descriptive Statistics

  • Sampling Bias :  Ensure that your sample is representative of the population you intend to study. Sampling bias can occur when certain groups or characteristics are over- or underrepresented in the sample, leading to biased results.
  • Selection Bias:  Be cautious of selection bias, where specific data points are systematically included or excluded based on criteria that are unrelated to the research question. This can distort the analysis.
  • Confirmation Bias:  Avoid the tendency to seek, interpret, or remember information in a way that confirms preexisting beliefs or hypotheses. This bias can lead to selective attention and misinterpretation of data.
  • Reporting Bias:  Be transparent in reporting all relevant data, even if the results do not support your hypothesis or are inconclusive. Omitting such data can create a biased view of the overall picture.

Awareness of these common mistakes and pitfalls can help you conduct more robust and accurate analyses using descriptive statistics, leading to more reliable and meaningful conclusions in your research and decision-making processes.

Descriptive statistics are the essential building blocks of data analysis. They provide us with the means to summarize, visualize, and comprehend the often intricate world of data. By mastering these techniques, you have gained a valuable skill that can be applied across a multitude of fields and industries. From making informed business decisions to advancing scientific research, from understanding market trends to improving healthcare outcomes, descriptive statistics serve as our trusted guides in the realm of data.

You've learned how to calculate measures of central tendency, assess variability, explore data distributions, and employ powerful visualization tools. You've seen how descriptive statistics bring clarity to the chaos of data, revealing patterns and outliers, guiding your decisions, and enabling you to communicate insights effectively . As you continue to work with data, remember that descriptive statistics are your steadfast companions, ready to help you navigate the data landscape, extract valuable insights, and make informed choices based on evidence rather than guesswork.

How to Collect Descriptive Statistics in Minutes?

Introducing Appinio , the real-time market research platform that's revolutionizing how businesses harness consumer insights. Imagine conducting your own market research in minutes, with the power of descriptive statistics at your fingertips.

Here's why Appinio is your go-to choice for fast, data-driven decisions:

Instant Insights: From questions to insights in minutes. Appinio accelerates your decision-making process, delivering real-time results when you need them most.

User-Friendly: No need for a PhD in research. Appinio's intuitive platform ensures that anyone can seamlessly gather and analyze data, making market research accessible to all.

Global Reach: Define your target group from 1200+ characteristics and survey it in over 90 countries. With Appinio, you can tap into a diverse pool of respondents worldwide.

Register now EN

Get free access to the platform!

Join the loop 💌

Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.

Get the latest market research news straight to your inbox! 💌

Wait, there's more

360-Degree Feedback Survey Process Software Examples

15.08.2024 | 31min read

360-Degree Feedback: Survey, Process, Software, Examples

What is ANOVA Test Definition Types Examples

13.08.2024 | 30min read

What is ANOVA Test? Definition, Types, Examples

Environmental Analysis Definition Steps Tools Examples

08.08.2024 | 30min read

Environmental Analysis: Definition, Steps, Tools, Examples

Descriptive Statistics: Definitions, Types, Examples

Introduction.

The first step of any data-related process is the collection of data. Once we have collected the data, what do we do with it? Data can be sorted, analyzed, and used in various methods and formats, depending on the project’s needs. While analyzing a dataset, We use statistical methods to arrive at a conclusion. Data-driven decision-making also depends on how efficiently we use these methods. Two types of statistical methods are widely used in data analysis: descriptive and inferential. This article will focus more on descriptive statistics, its types, calculations, examples,percentages etc.

This article was published as a part of the  Data Science Blogathon .

Table of contents

What is descriptive statistics, types of statistics, what is inferential statistics, types of descriptive statistics, descriptive statistics based on the central tendency of data, descriptive statistics based on the dispersion of data, descriptive statistics based on the shape of the data, univariate data vs. bivariate data in descriptive statistics, what are the 10 commonly used descriptive statistics, can descriptive statistics be used to make inferences or predictions, frequently asked questions.

Descriptive statistics serves as the initial step in understanding and summarizing data . It involves organizing, visualizing, and summarizing raw data to create a coherent picture. The primary goal of descriptive statistics is to provide a clear and concise overview of the data’s main features. This helps us identify patterns, trends, and characteristics within the data set without making broader inferences.

Key Aspects of Descriptive Statistics

  • Measures of Central Tendency: Descriptive statistics include calculating the mean, median, and mode, which offer insights into the center of the data distribution.
  • Measures of Dispersion: Variance, standard deviation, and range help us understand the spread or variability of the data.
  • Visualizations: Creating graphs, histograms, bar charts, and pie charts visually represent the data’s distribution and characteristics

When you delve into the world of statistics, you’ll encounter two fundamental branches: descriptive statistics and inferential statistics. These two distinct approaches help us make sense of data and draw conclusions. Let’s look at the differences between these two branches to shed light on their roles in the realm of statistical analysis and their total number of branches.

AspectDescriptive StatisticsInferential Statistics
PurposeSummarize and describe dataDraw conclusions or predictions
Data SampleAnalyzes the entire datasetAnalyzes a sample of the data
ExamplesMean, Median, Range, VarianceHypothesis testing, Regression
ScopeFocuses on data characteristicsMakes inferences about populations
GoalProvides insights and simplifies dataGeneralizes findings to a larger population
AssumptionsNo assumptions about populationsRequires assumptions about populations
Common Use CasesData visualization, data explorationScientific research, hypothesis testing

Inferential statistics takes data analysis to the next level by drawing conclusions about populations based on a sample. It involves making predictions, generalizations, and hypotheses about a larger group using a smaller subset of data. Inferential statistics bridges the gap between our data and the conclusions we want to reach. This is particularly useful when obtaining data from an entire population is impractical or impossible.

Key Aspects of Inferential Statistics

  • Sampling Techniques: Inferential statistics relies on carefully selecting representative samples from a population to make valid inferences.
  • Hypothesis Testing: This process involves setting up hypotheses about population characteristics and using sample data to determine if these hypotheses are statistically significant.
  • Confidence Intervals: These provide a range of values within which we’re confident a population parameter lies based on sample data.
  • Regression Analysis: Inferential statistics also encompass techniques like regression analysis to model relationships between variables and predict outcomes.

Now we will look at descriptive statistics in detail.

There are various dimensions in which this data can be described. The three main dimensions used for describing data are the central tendency, dispersion, and the shape of the data. Now, let’s look at them in detail, one by one.

The central tendency of data is the center of the distribution of data. It describes the location of data and concentrates on where the data is located. The three most widely used measures of the “center” of the data are Mean, Median, and Mode.

central tendency | descriptive statistics

The “Mean” is the average of the data. The average can be identified by summing up all the numbers and then dividing them by the number of observations.

Mean = X 1 + X 2 + X 3 +… +   X n / n

Data – 10,20,30,40,50  and Number of observations = 5 Mean = [ 10+20+30+40+50 ] / 5 Mean = 30

The central tendency of the data may be influenced by outliers. You may now ask, ‘ What are outliers? ‘ Well, outliers are extreme behaviors. An outlier is a data point that differs significantly from other observations. It can cause serious problems in analysis.

outlier | descriptive statistics

Data – 10,20,30,40,200 Mean = [ 10+20+30+40+200 ] / 5 Mean = 60

Solution for the outliers problem: Removing the outliers while taking averages will give us better results.

It is the 50th percentile of the data. In other words, it is exactly the center point of the data. The median can be identified by ordering the data, splitting it into two equal parts, and then finding the number in the middle. It is the best way to find the center of the data.

Note that, in this case, the central tendency of the data is not affected by outliers.

median

Odd number of Data – 10,20,30,40,50 Median is 30. Even the number of data – 10,20,30,40,50,60

Find the middle 2 data and take the mean of those two values. Here, 30 and 40 are middle values. Now, add them and divide the result by 2 30+40 / 2  =35 Median is 35

The mode of the data is the most frequently occurring data or elements in a dataset. If an element occurs the highest number of times, it is the mode of that data. If no number in the data is repeated, then that data has no mode. There can be more than one mode in a dataset if two values have the same frequency, which is also the highest frequency.

Outliers don’t influence the data in this case. The mode can be calculated for both quantitative and qualitative data.

mode

Data – 1,3,4,6,7,3,3,5,10, 3 Mode is 3, because 3 has the highest frequency (4 times)

The dispersion is the “spread of the data”. It measures how far the data is spread. In most of the dataset, the data values are closely located near the mean. The values are widely spread out of the mean on some other datasets. These dispersions of data can be measured by the Inter Quartile Range (IQR), range, standard deviation, and variance of the data.

dispersion of data descriptive statistics

Let us see these measures in detail.

Inter Quartile Range (IQR)

Quartiles are special percentiles. 1st Quartile Q1  is the same as the 25th percentile. 2nd Quartile Q2  is the same as 50th percentile. 3rd Quratile Q3  is same as 75th percentile

Steps to find quartile and percentile

  • The data should sorted and ordered from the smallest to the largest.
  • For Quartiles, ordered data is divided into 4 equal parts.
  • For Percentiles, ordered data is divided into 100 equal parts.

The Inter Quartile Range is the difference between the third quartile (Q3) and the first quartile (Q1)

IQR = Q3 – Q1

iqr

In this example, the Inter Quartile range is the spread of the middle half (50%) of the data.

The range is the difference between the largest and the smallest value in the data.

Standard Deviation

The most common measure of spread is the standard deviation. The Standard deviation measures how far the data deviates from the mean value. The standard deviation formula varies for population and and highest value of sample. Both formulas are similar but not the same.

Symbol used for Sample Standard Deviation  –  “s” (lowercase) Symbol used for Population Standard Deviation – “ σ”  (sigma, lower case)

Steps to find the Standard Deviation

If x is a number, then the difference “x – mean” is its deviation. The deviations are used to calculate the standard deviation.

Sample Standard Deviation, s  = Square root of sample variance  Sample Standard Deviation, s = Square root of    [Σ(x − x ¯ ) 2 / n-1]   where x ¯ is average and n is  no. of samples

standard deviation

Population Standard Deviation, σ = Square root of population variance Population Standard Deviation, σ = Square root of  [  Σ(x − μ) 2 / N ] where μ is Mean and N is no.of population.

sd for population descriptive statistics

The standard deviation is always positive or zero. It will be large when the data values are spread out from the mean.

The variance is a measure of variability. It is the average squared deviation from the mean. The symbol σ 2 represents the population variance, and the symbol for s 2 represents sample variance.

variance

The shape of the data is important because deciding the probability of data is based on its shape. The shape describes the type of the graph.

type of graph

The shape of the data can be measured by three methodologies: symmetric, skewness, kurtosis

In the symmetric shape of the graph, the data is distributed the same on both sides. In symmetric data, the mean and median are located close together. The curve formed by this symmetric graph is called a normal curve.

skewed

Skewness is the measure of the asymmetry of the distribution of data. The data is not symmetrical (i.e.) it is skewed towards one side. Skewness is classified into two types: positive skew and negative skew.

  • Positively skewed : In a Positively skewed distribution, the data values are clustered around the left side of the distribution, and the right side is longer. The mean and median will be greater than the mode in the positive skew.
  • Negatively skewed : In a Negatively skewed distribution, the data values are clustered around the right side of the distribution, and the left side is longer. The mean and median will be less than the mode.

Positive.Negative skewed and unskewed

Kurtosis is the measure of describing the distribution of data. This data is distributed in three different ways: platykurtic, mesokurtic, and leptokurtic.

differences

  • Platykurtic : The platykurtic shows a distribution with flat tails. Here, the data is distributed fairly. The flat tails indicated the small outliers in the distribution.

platykurtic descriptive statistics

  • Mesokurtic : In Mesokurtic, the data is widely distributed. It is normally distributed, and it also matches normal distribution.

mesokurtic

  • Leptokurtic : In leptokurtic, the data is very closely distributed. The height of the peak is greater than the width of the peak.

leptokurtic

When it comes to delving into the world of data analysis, two key terms you’re likely to encounter are “ Univariate ” and “ Bivariate .” These terms are crucial in descriptive statistics, as they help us categorize and understand the data types we’re working with. Whether you’re deciphering the properties of individual data points or unraveling the intricate dance between two variables, the concepts of univariate and bivariate data provide the foundation for insightful data analysis.

the key difference between univariate and bivariate data lies in the focus of analysis. Univariate analysis centers on understanding the characteristics of a single variable, while bivariate analysis explores connections and interactions between two variables. Let’s break down the differences between univariate and bivariate data to better grasp their significance.

Univariate Data

Univariate data focuses on a single variable, essentially spotlighting one aspect of your data. In this scenario, you’re interested in studying the distribution, central tendency, and dispersion of a single set of values. For instance, if you’re analyzing the heights of a group of individuals, you’re dealing with univariate data. Here, the variable of interest is height, and you aim to uncover insights about that specific characteristic.

In univariate analysis, you’re often looking at measures like:

  • Measures of Central Tendency: Mean, median, and mode provide insights into where the center of the data lies.
  • Measures of Dispersion: Range, variance, and standard deviation help you understand how spread out the data is.
  • Frequency Distribution: Creating histograms, bar charts, and pie charts allows you to visualize the data’s distribution.

Bivariate Data

Bivariate data, on the other hand, adds an extra layer of complexity to your analysis by involving two variables. Here, you’re not just interested in understanding individual characteristics; you’re also keen on uncovering relationships and patterns between two different variables. For example, if you’re examining the relationship between hours of study and exam scores, you’re working with bivariate data. The goal is to determine whether changes in one variable (study hours) have an impact on another (exam scores).

Bivariate analysis often involves techniques such as:

  • Scatter Plots: These visualizations showcase the relationship between two variables, with each data point plotted on the graph.
  • Correlation: Calculating correlation coefficients helps you quantify the strength and direction of the relationship between variables.
  • Regression Analysis: This technique allows you to model the relationship between variables, predicting the outcome of one based on the other.

There are actually many useful descriptive statistics, but here are 5 of the most commonly used:

  • Mean : This is the average of all the values in a data set. It’s a good indicator of the overall center of the data, but can be sensitive to outliers, especially in multivariate data with extreme values.
  • Median : This is the ‘middle’ value when the data is ordered from least to greatest. It’s less affected by outliers than the mean, making it a robust measure for box plot analyses.
  • Mode : This is the most frequent value in a data set. There can be one mode, or even multiple modes in some cases, especially when dealing with categorical variables.
  • Standard Deviation : This tells you how spread out the data is from the mean. A larger standard deviation indicates a wider spread of data points. It’s crucial in understanding the dispersion in multivariate data.
  • Range : This is the difference between the highest and lowest values in the data set. It’s a simple way to gauge how much variation there is but doesn’t tell you anything about the distribution within that range. It’s often represented in graphical representations like box plots.
  • Categorical Variables : These are variables that represent distinct groups or categories. Analysis often involves graphical representations and contingency tables to understand the relationships between categories.
  • Contingency Tables : These tables are used to display the frequency distribution of categorical variables. They help in analyzing the relationship between different categorical variables in multivariate data.
  • Box Plot : A graphical representation that shows the distribution of a dataset through its quartiles. It highlights the median, quartiles, and extreme values, providing a clear picture of the data’s spread and potential outliers.
  • Graphical Representation : This involves using visual tools like box plots, histograms, and scatter plots to summarize and analyze data, making it easier to identify patterns, trends, and extreme values in both univariate and multivariate datasets.
  • Extreme Values : These are the data points that are significantly higher or lower than the majority of the data. They can heavily influence the mean and standard deviation and are often highlighted in box plots and other graphical representations

Descriptive statistics themselves are not used for predictions, but they can lay the groundwork for them. Here’s the key difference:

Descriptive statistics summarize the data you have. They use measures like mean, median, and standard deviation to give you a general idea of what the data looks like. This process often involves exploratory data analysis, where open exploration of the data can reveal patterns and insights. For instance, calculating mean scores is a common part of this analysis.

Inferential statistics use the data you have to draw conclusions about a larger population. This allows you to make predictions about things you haven’t observed yet. Here, you would identify the dependent variable and independent variable in your study, which are crucial for making these inferences.

Think of it like this: Descriptive statistics describe your apartment, while inferential statistics use the features of your apartment to guess about the entire apartment building.

So, while descriptive statistics can’t directly predict the future, they can help you understand the data and prepare it for inferential statistics, which can then be used for predictions. Summary statistics from your exploratory data analysis can provide the foundation for these predictive models.

In a world flooded with data, understanding, interpreting, and communicating information is paramount. Descriptive statistics doesn’t just crunch numbers; it crafts narratives, constructs visualizations, and empowers us to make informed decisions. Hope this article has given you a brief introduction to descriptive statistics. In this article, we have seen how the various measures of descriptive statistics, such as central tendency, dispersion, and shape of the data curve, help decipher the numbers. We have also bridged the gap between individual characteristics and the dance between variables by learning about univariate and bivariate data.

Also, this article will help you with the standard deviation of these statistics and statisticians. Not only Multivariate analysis measures of spread the sample size of the shape of the distribution of these statistics.

Ans. The methods used to summarize and describe the main features of a dataset are called descriptive statistics. Measures of central tendencies, measures of variability, etc., which give information about the typical values in a dataset, are all examples of descriptive statistics.

Ans. The 5 descriptive statistics include standard deviation, minimum and maximum variables, variance, kurtosis, and skewness.

Ans. The frequency distribution, central tendency, and variability of a dataset are the 3 main types of descriptive statistics.

Ans. Descriptive statistics are of 3 types: frequency distribution, central tendency, and variability.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Recommended Articles

Descriptive vs Inferential Statistics: What’...

Get Started with Statistics for Data Science

Data Types in Statistics for Data Science

A Guide To Complete Statistics For Data Science Be...

End to End Statistics for Data Science

15 Basic Statistics Concepts Every Data Science Be...

Various Uses of Python Statistics Module & It...

An Introduction to Statistics For Data Science: Ba...

Mathematics for Data Science

Top 40 Data Science Statistics Interview Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear Submit reply

suresh

I have seen so many websites/videos. But did not understood few concepts, after this page - I understood very clearly without any doubts. Kudos to those who prepared this tutorial. Thanking you very much...!!!!

Write for us

Write, captivate, and earn accolades and rewards for your work

  • Reach a Global Audience
  • Get Expert Feedback
  • Build Your Brand & Audience
  • Cash In on Your Knowledge
  • Join a Thriving Community
  • Level Up Your Data Science Game

imag

Sion Chakrabarti

CHIRAG GOYAL

CHIRAG GOYAL

Barney Darlington

Barney Darlington

Suvojit Hore

Suvojit Hore

Arnab Mondal

Arnab Mondal

Prateek Majumder

Prateek Majumder

GenAI Pinnacle Program

Revolutionizing ai learning & development.

  • 1:1 Mentorship with Generative AI experts
  • Advanced Curriculum with 200+ Hours of Learning
  • Master 26+ GenAI Tools and Libraries

Enroll with us today!

Continue your learning for free, enter email address to continue, enter otp sent to.

Resend OTP in 45s

Privacy Overview

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Descriptive Statistics for Summarising Data

Ray w. cooksey.

UNE Business School, University of New England, Armidale, NSW Australia

This chapter discusses and illustrates descriptive statistics . The purpose of the procedures and fundamental concepts reviewed in this chapter is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data (e.g. a histogram, box plot, radar plot, stem-and-leaf display, icon plot or line graph) or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement (e.g., frequency counts, measures of central tendency, variability, standard scores). Along the way, we explore the fundamental concepts of probability and the normal distribution. We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

The first broad category of statistics we discuss concerns descriptive statistics . The purpose of the procedures and fundamental concepts in this category is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement.

We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

Reflect on the QCI research scenario and the associated data set discussed in Chap. 10.1007/978-981-15-2537-7_4. Consider the following questions that Maree might wish to address with respect to decision accuracy and speed scores:

  • What was the typical level of accuracy and decision speed for inspectors in the sample? [see Procedure 5.4 – Assessing central tendency.]
  • What was the most common accuracy and speed score amongst the inspectors? [see Procedure 5.4 – Assessing central tendency.]
  • What was the range of accuracy and speed scores; the lowest and the highest scores? [see Procedure 5.5 – Assessing variability.]
  • How frequently were different levels of inspection accuracy and speed observed? What was the shape of the distribution of inspection accuracy and speed scores? [see Procedure 5.1 – Frequency tabulation, distributions & crosstabulation.]
  • What percentage of inspectors would have ‘failed’ to ‘make the cut’ assuming the industry standard for acceptable inspection accuracy and speed combined was set at 95%? [see Procedure 5.7 – Standard ( z ) scores.]
  • How variable were the inspectors in their accuracy and speed scores? Were all the accuracy and speed levels relatively close to each other in magnitude or were the scores widely spread out over the range of possible test outcomes? [see Procedure 5.5 – Assessing variability.]
  • What patterns might be visually detected when looking at various QCI variables singly and together as a set? [see Procedure 5.2 – Graphical methods for dispaying data, Procedure 5.3 – Multivariate graphs & displays, and Procedure 5.6 – Exploratory data analysis.]

This chapter includes discussions and illustrations of a number of procedures available for answering questions about data like those posed above. In addition, you will find discussions of two fundamental concepts, namely probability and the normal distribution ; concepts that provide building blocks for Chaps. 10.1007/978-981-15-2537-7_6 and 10.1007/978-981-15-2537-7_7.

Procedure 5.1: Frequency Tabulation, Distributions & Crosstabulation

Frequency tabulation and distributions.

Frequency tabulation serves to provide a convenient counting summary for a set of data that facilitates interpretation of various aspects of those data. Basically, frequency tabulation occurs in two stages:

  • First, the scores in a set of data are rank ordered from the lowest value to the highest value.
  • Second, the number of times each specific score occurs in the sample is counted. This count records the frequency of occurrence for that specific data value.

Consider the overall job satisfaction variable, jobsat , from the QCI data scenario. Performing frequency tabulation across the 112 Quality Control Inspectors on this variable using the SPSS Frequencies procedure (Allen et al. 2019 , ch. 3; George and Mallery 2019 , ch. 6) produces the frequency tabulation shown in Table 5.1 . Note that three of the inspectors in the sample did not provide a rating for jobsat thereby producing three missing values (= 2.7% of the sample of 112) and leaving 109 inspectors with valid data for the analysis.

Frequency tabulation of overall job satisfaction scores

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab1_HTML.jpg

The display of frequency tabulation is often referred to as the frequency distribution for the sample of scores. For each value of a variable, the frequency of its occurrence in the sample of data is reported. It is possible to compute various percentages and percentile values from a frequency distribution.

Table 5.1 shows the ‘Percent’ or relative frequency of each score (the percentage of the 112 inspectors obtaining each score, including those inspectors who were missing scores, which SPSS labels as ‘System’ missing). Table 5.1 also shows the ‘Valid Percent’ which is computed only for those inspectors in the sample who gave a valid or non-missing response.

Finally, it is possible to add up the ‘Valid Percent’ values, starting at the low score end of the distribution, to form the cumulative distribution or ‘Cumulative Percent’ . A cumulative distribution is useful for finding percentiles which reflect what percentage of the sample scored at a specific value or below.

We can see in Table 5.1 that 4 of the 109 valid inspectors (a ‘Valid Percent’ of 3.7%) indicated the lowest possible level of job satisfaction—a value of 1 (Very Low) – whereas 18 of the 109 valid inspectors (a ‘Valid Percent’ of 16.5%) indicated the highest possible level of job satisfaction—a value of 7 (Very High). The ‘Cumulative Percent’ number of 18.3 in the row for the job satisfaction score of 3 can be interpreted as “roughly 18% of the sample of inspectors reported a job satisfaction score of 3 or less”; that is, nearly a fifth of the sample expressed some degree of negative satisfaction with their job as a quality control inspector in their particular company.

If you have a large data set having many different scores for a particular variable, it may be more useful to tabulate frequencies on the basis of intervals of scores.

For the accuracy scores in the QCI database, you could count scores occurring in intervals such as ‘less than 75% accuracy’, ‘between 75% but less than 85% accuracy’, ‘between 85% but less than 95% accuracy’, and ‘95% accuracy or greater’, rather than counting the individual scores themselves. This would yield what is termed a ‘grouped’ frequency distribution since the data have been grouped into intervals or score classes. Producing such an analysis using SPSS would involve extra steps to create the new category or ‘grouping’ system for scores prior to conducting the frequency tabulation.

Crosstabulation

In a frequency crosstabulation , we count frequencies on the basis of two variables simultaneously rather than one; thus we have a bivariate situation.

For example, Maree might be interested in the number of male and female inspectors in the sample of 112 who obtained each jobsat score. Here there are two variables to consider: inspector’s gender and inspector’s j obsat score. Table 5.2 shows such a crosstabulation as compiled by the SPSS Crosstabs procedure (George and Mallery 2019 , ch. 8). Note that inspectors who did not report a score for jobsat and/or gender have been omitted as missing values, leaving 106 valid inspectors for the analysis.

Frequency crosstabulation of jobsat scores by gender category for the QCI data

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab2_HTML.jpg

The crosstabulation shown in Table 5.2 gives a composite picture of the distribution of satisfaction levels for male inspectors and for female inspectors. If frequencies or ‘Counts’ are added across the gender categories, we obtain the numbers in the ‘Total’ column (the percentages or relative frequencies are also shown immediately below each count) for each discrete value of jobsat (note this column of statistics differs from that in Table 5.1 because the gender variable was missing for certain inspectors). By adding down each gender column, we obtain, in the bottom row labelled ‘Total’, the number of males and the number of females that comprised the sample of 106 valid inspectors.

The totals, either across the rows or down the columns of the crosstabulation, are termed the marginal distributions of the table. These marginal distributions are equivalent to frequency tabulations for each of the variables jobsat and gender . As with frequency tabulation, various percentage measures can be computed in a crosstabulation, including the percentage of the sample associated with a specific count within either a row (‘% within jobsat ’) or a column (‘% within gender ’). You can see in Table 5.2 that 18 inspectors indicated a job satisfaction level of 7 (Very High); of these 18 inspectors reported in the ‘Total’ column, 8 (44.4%) were male and 10 (55.6%) were female. The marginal distribution for gender in the ‘Total’ row shows that 57 inspectors (53.8% of the 106 valid inspectors) were male and 49 inspectors (46.2%) were female. Of the 57 male inspectors in the sample, 8 (14.0%) indicated a job satisfaction level of 7 (Very High). Furthermore, we could generate some additional interpretive information of value by adding the ‘% within gender’ values for job satisfaction levels of 5, 6 and 7 (i.e. differing degrees of positive job satisfaction). Here we would find that 68.4% (= 24.6% + 29.8% + 14.0%) of male inspectors indicated some degree of positive job satisfaction compared to 61.2% (= 10.2% + 30.6% + 20.4%) of female inspectors.

This helps to build a picture of the possible relationship between an inspector’s gender and their level of job satisfaction (a relationship that, as we will see later, can be quantified and tested using Procedure 10.1007/978-981-15-2537-7_6#Sec14 and Procedure 10.1007/978-981-15-2537-7_7#Sec17).

It should be noted that a crosstabulation table such as that shown in Table 5.2 is often referred to as a contingency table about which more will be said later (see Procedure 10.1007/978-981-15-2537-7_7#Sec17 and Procedure 10.1007/978-981-15-2537-7_7#Sec115).

Frequency tabulation is useful for providing convenient data summaries which can aid in interpreting trends in a sample, particularly where the number of discrete values for a variable is relatively small. A cumulative percent distribution provides additional interpretive information about the relative positioning of specific scores within the overall distribution for the sample.

Crosstabulation permits the simultaneous examination of the distributions of values for two variables obtained from the same sample of observations. This examination can yield some useful information about the possible relationship between the two variables. More complex crosstabulations can be also done where the values of three or more variables are tracked in a single systematic summary. The use of frequency tabulation or cross-tabulation in conjunction with various other statistical measures, such as measures of central tendency (see Procedure 5.4 ) and measures of variability (see Procedure 5.5 ), can provide a relatively complete descriptive summary of any data set.

Disadvantages

Frequency tabulations can get messy if interval or ratio-level measures are tabulated simply because of the large number of possible data values. Grouped frequency distributions really should be used in such cases. However, certain choices, such as the size of the score interval (group size), must be made, often arbitrarily, and such choices can affect the nature of the final frequency distribution.

Additionally, percentage measures have certain problems associated with them, most notably, the potential for their misinterpretation in small samples. One should be sure to know the sample size on which percentage measures are based in order to obtain an interpretive reference point for the actual percentage values.

For example

In a sample of 10 individuals, 20% represents only two individuals whereas in a sample of 300 individuals, 20% represents 60 individuals. If all that is reported is the 20%, then the mental inference drawn by readers is likely to be that a sizeable number of individuals had a score or scores of a particular value—but what is ‘sizeable’ depends upon the total number of observations on which the percentage is based.

Where Is This Procedure Useful?

Frequency tabulation and crosstabulation are very commonly applied procedures used to summarise information from questionnaires, both in terms of tabulating various demographic characteristics (e.g. gender, age, education level, occupation) and in terms of actual responses to questions (e.g. numbers responding ‘yes’ or ‘no’ to a particular question). They can be particularly useful in helping to build up the data screening and demographic stories discussed in Chap. 10.1007/978-981-15-2537-7_4. Categorical data from observational studies can also be analysed with this technique (e.g. the number of times Suzy talks to Frank, to Billy, and to John in a study of children’s social interactions).

Certain types of experimental research designs may also be amenable to analysis by crosstabulation with a view to drawing inferences about distribution differences across the sets of categories for the two variables being tracked.

You could employ crosstabulation in conjunction with the tests described in Procedure 10.1007/978-981-15-2537-7_7#Sec17 to see if two different styles of advertising campaign differentially affect the product purchasing patterns of male and female consumers.

In the QCI database, Maree could employ crosstabulation to help her answer the question “do different types of electronic manufacturing firms ( company ) differ in terms of their tendency to employ male versus female quality control inspectors ( gender )?”

Software Procedures

ApplicationProcedures
SPSS or . and select the variable(s) you wish to analyse; for the procedure, hitting the ‘ ’ button will allow you to choose various types of statistics and percentages to show in each cell of the table.
NCSS or and select the variable(s) you wish to analyse.
SYSTAT or ➔ and select the variable(s) you wish to analyse and choose the optional statistics you wish to see.
STATGRAPHICS or and select the variable(s) you wish to analyse; hit ‘ ’ and when the ‘Tables and Graphs’ window opens, choose the Tables and Graphs you wish to see.
Commander or and select the variable(s) you wish to analyse and choose the optional statistics you wish to see.

Procedure 5.2: Graphical Methods for Displaying Data

Graphical methods for displaying data include bar and pie charts, histograms and frequency polygons, line graphs and scatterplots. It is important to note that what is presented here is a small but representative sampling of the types of simple graphs one can produce to summarise and display trends in data. Generally speaking, SPSS offers the easiest facility for producing and editing graphs, but with a rather limited range of styles and types. SYSTAT, STATGRAPHICS and NCSS offer a much wider range of graphs (including graphs unique to each package), but with the drawback that it takes somewhat more effort to get the graphs in exactly the form you want.

Bar and Pie Charts

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are categorical (nominal or ordinal level of measurement).

  • A bar chart uses vertical and horizontal axes to summarise the data. The vertical axis is used to represent frequency (number) of occurrence or the relative frequency (percentage) of occurrence; the horizontal axis is used to indicate the data categories of interest.
  • A pie chart gives a simpler visual representation of category frequencies by cutting a circular plot into wedges or slices whose sizes are proportional to the relative frequency (percentage) of occurrence of specific data categories. Some pie charts can have a one or more slices emphasised by ‘exploding’ them out from the rest of the pie.

Consider the company variable from the QCI database. This variable depicts the types of manufacturing firms that the quality control inspectors worked for. Figure 5.1 illustrates a bar chart summarising the percentage of female inspectors in the sample coming from each type of firm. Figure 5.2 shows a pie chart representation of the same data, with an ‘exploded slice’ highlighting the percentage of female inspectors in the sample who worked for large business computer manufacturers – the lowest percentage of the five types of companies. Both graphs were produced using SPSS.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig1_HTML.jpg

Bar chart: Percentage of female inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig2_HTML.jpg

Pie chart: Percentage of female inspectors

The pie chart was modified with an option to show the actual percentage along with the label for each category. The bar chart shows that computer manufacturing firms have relatively fewer female inspectors compared to the automotive and electrical appliance (large and small) firms. This trend is less clear from the pie chart which suggests that pie charts may be less visually interpretable when the data categories occur with rather similar frequencies. However, the ‘exploded slice’ option can help interpretation in some circumstances.

Certain software programs, such as SPSS, STATGRAPHICS, NCSS and Microsoft Excel, offer the option of generating 3-dimensional bar charts and pie charts and incorporating other ‘bells and whistles’ that can potentially add visual richness to the graphic representation of the data. However, you should generally be careful with these fancier options as they can produce distortions and create ambiguities in interpretation (e.g. see discussions in Jacoby 1997 ; Smithson 2000 ; Wilkinson 2009 ). Such distortions and ambiguities could ultimately end up providing misinformation to researchers as well as to those who read their research.

Histograms and Frequency Polygons

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are essentially continuous (interval or ratio level of measurement) in nature. Both histograms and frequency polygons use vertical and horizontal axes to summarise the data. The vertical axis is used to represent the frequency (number) of occurrence or the relative frequency (percentage) of occurrences; the horizontal axis is used for the data values or ranges of values of interest. The histogram uses bars of varying heights to depict frequency; the frequency polygon uses lines and points.

There is a visual difference between a histogram and a bar chart: the bar chart uses bars that do not physically touch, signifying the discrete and categorical nature of the data, whereas the bars in a histogram physically touch to signal the potentially continuous nature of the data.

Suppose Maree wanted to graphically summarise the distribution of speed scores for the 112 inspectors in the QCI database. Figure 5.3 (produced using NCSS) illustrates a histogram representation of this variable. Figure 5.3 also illustrates another representational device called the ‘density plot’ (the solid tracing line overlaying the histogram) which gives a smoothed impression of the overall shape of the distribution of speed scores. Figure 5.4 (produced using STATGRAPHICS) illustrates the frequency polygon representation for the same data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig3_HTML.jpg

Histogram of the speed variable (with density plot overlaid)

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig4_HTML.jpg

Frequency polygon plot of the speed variable

These graphs employ a grouped format where speed scores which fall within specific intervals are counted as being essentially the same score. The shape of the data distribution is reflected in these plots. Each graph tells us that the inspection speed scores are positively skewed with only a few inspectors taking very long times to make their inspection judgments and the majority of inspectors taking rather shorter amounts of time to make their decisions.

Both representations tell a similar story; the choice between them is largely a matter of personal preference. However, if the number of bars to be plotted in a histogram is potentially very large (and this is usually directly controllable in most statistical software packages), then a frequency polygon would be the preferred representation simply because the amount of visual clutter in the graph will be much reduced.

It is somewhat of an art to choose an appropriate definition for the width of the score grouping intervals (or ‘bins’ as they are often termed) to be used in the plot: choose too many and the plot may look too lumpy and the overall distributional trend may not be obvious; choose too few and the plot will be too coarse to give a useful depiction. Programs like SPSS, SYSTAT, STATGRAPHICS and NCSS are designed to choose an ‘appropriate’ number of bins to be used, but the analyst’s eye is often a better judge than any statistical rule that a software package would use.

There are several interesting variations of the histogram which can highlight key data features or facilitate interpretation of certain trends in the data. One such variation is a graph is called a dual histogram (available in SYSTAT; a variation called a ‘comparative histogram’ can be created in NCSS) – a graph that facilitates visual comparison of the frequency distributions for a specific variable for participants from two distinct groups.

Suppose Maree wanted to graphically compare the distributions of speed scores for inspectors in the two categories of education level ( educlev ) in the QCI database. Figure 5.5 shows a dual histogram (produced using SYSTAT) that accomplishes this goal. This graph still employs the grouped format where speed scores falling within particular intervals are counted as being essentially the same score. The shape of the data distribution within each group is also clearly reflected in this plot. However, the story conveyed by the dual histogram is that, while the inspection speed scores are positively skewed for inspectors in both categories of educlev, the comparison suggests that inspectors with a high school level of education (= 1) tend to take slightly longer to make their inspection decisions than do their colleagues who have a tertiary qualification (= 2).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig5_HTML.jpg

Dual histogram of speed for the two categories of educlev

Line Graphs

The line graph is similar in style to the frequency polygon but is much more general in its potential for summarising data. In a line graph, we seldom deal with percentage or frequency data. Instead we can summarise other types of information about data such as averages or means (see Procedure 5.4 for a discussion of this measure), often for different groups of participants. Thus, one important use of the line graph is to break down scores on a specific variable according to membership in the categories of a second variable.

In the context of the QCI database, Maree might wish to summarise the average inspection accuracy scores for the inspectors from different types of manufacturing companies. Figure 5.6 was produced using SPSS and shows such a line graph.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig6_HTML.jpg

Line graph comparison of companies in terms of average inspection accuracy

Note how the trend in performance across the different companies becomes clearer with such a visual representation. It appears that the inspectors from the Large Business Computer and PC manufacturing companies have better average inspection accuracy compared to the inspectors from the remaining three industries.

With many software packages, it is possible to further elaborate a line graph by including error or confidence intervals bars (see Procedure 10.1007/978-981-15-2537-7_8#Sec18). These give some indication of the precision with which the average level for each category in the population has been estimated (narrow bars signal a more precise estimate; wide bars signal a less precise estimate).

Figure 5.7 shows such an elaborated line graph, using 95% confidence interval bars, which can be used to help make more defensible judgments (compared to Fig. 5.6 ) about whether the companies are substantively different from each other in average inspection performance. Companies whose confidence interval bars do not overlap each other can be inferred to be substantively different in performance characteristics.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig7_HTML.jpg

Line graph using confidence interval bars to compare accuracy across companies

The accuracy confidence interval bars for participants from the Large Business Computer manufacturing firms do not overlap those from the Large or Small Electrical Appliance manufacturers or the Automobile manufacturers.

We might conclude that quality control inspection accuracy is substantially better in the Large Business Computer manufacturing companies than in these other industries but is not substantially better than the PC manufacturing companies. We might also conclude that inspection accuracy in PC manufacturing companies is not substantially different from Small Electrical Appliance manufacturers.

Scatterplots

Scatterplots are useful in displaying the relationship between two interval- or ratio-scaled variables or measures of interest obtained on the same individuals, particularly in correlational research (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

In a scatterplot, one variable is chosen to be represented on the horizontal axis; the second variable is represented on the vertical axis. In this type of plot, all data point pairs in the sample are graphed. The shape and tilt of the cloud of points in a scatterplot provide visual information about the strength and direction of the relationship between the two variables. A very compact elliptical cloud of points signals a strong relationship; a very loose or nearly circular cloud signals a weak or non-existent relationship. A cloud of points generally tilted upward toward the right side of the graph signals a positive relationship (higher scores on one variable associated with higher scores on the other and vice-versa). A cloud of points generally tilted downward toward the right side of the graph signals a negative relationship (higher scores on one variable associated with lower scores on the other and vice-versa).

Maree might be interested in displaying the relationship between inspection accuracy and inspection speed in the QCI database. Figure 5.8 , produced using SPSS, shows what such a scatterplot might look like. Several characteristics of the data for these two variables can be noted in Fig. 5.8 . The shape of the distribution of data points is evident. The plot has a fan-shaped characteristic to it which indicates that accuracy scores are highly variable (exhibit a very wide range of possible scores) at very fast inspection speeds but get much less variable and tend to be somewhat higher as inspection speed increases (where inspectors take longer to make their quality control decisions). Thus, there does appear to be some relationship between inspection accuracy and inspection speed (a weak positive relationship since the cloud of points tends to be very loose but tilted generally upward toward the right side of the graph – slower speeds tend to be slightly associated with higher accuracy.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig8_HTML.jpg

Scatterplot relating inspection accuracy to inspection speed

However, it is not the case that the inspection decisions which take longest to make are necessarily the most accurate (see the labelled points for inspectors 7 and 62 in Fig. 5.8 ). Thus, Fig. 5.8 does not show a simple relationship that can be unambiguously summarised by a statement like “the longer an inspector takes to make a quality control decision, the more accurate that decision is likely to be”. The story is more complicated.

Some software packages, such as SPSS, STATGRAPHICS and SYSTAT, offer the option of using different plotting symbols or markers to represent the members of different groups so that the relationship between the two focal variables (the ones anchoring the X and Y axes) can be clarified with reference to a third categorical measure.

Maree might want to see if the relationship depicted in Fig. 5.8 changes depending upon whether the inspector was tertiary-qualified or not (this information is represented in the educlev variable of the QCI database).

Figure 5.9 shows what such a modified scatterplot might look like; the legend in the upper corner of the figure defines the marker symbols for each category of the educlev variable. Note that for both High School only-educated inspectors and Tertiary-qualified inspectors, the general fan-shaped relationship between accuracy and speed is the same. However, it appears that the distribution of points for the High School only-educated inspectors is shifted somewhat upward and toward the right of the plot suggesting that these inspectors tend to be somewhat more accurate as well as slower in their decision processes.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig9_HTML.jpg

Scatterplot displaying accuracy vs speed conditional on educlev group

There are many other styles of graphs available, often dependent upon the specific statistical package you are using. Interestingly, NCSS and, particularly, SYSTAT and STATGRAPHICS, appear to offer the most variety in terms of types of graphs available for visually representing data. A reading of the user’s manuals for these programs (see the Useful additional readings) would expose you to the great diversity of plotting techniques available to researchers. Many of these techniques go by rather interesting names such as: Chernoff’s faces, radar plots, sunflower plots, violin plots, star plots, Fourier blobs, and dot plots.

These graphical methods provide summary techniques for visually presenting certain characteristics of a set of data. Visual representations are generally easier to understand than a tabular representation and when these plots are combined with available numerical statistics, they can give a very complete picture of a sample of data. Newer methods have become available which permit more complex representations to be depicted, opening possibilities for creatively visually representing more aspects and features of the data (leading to a style of visual data storytelling called infographics ; see, for example, McCandless 2014 ; Toseland and Toseland 2012 ). Many of these newer methods can display data patterns from multiple variables in the same graph (several of these newer graphical methods are illustrated and discussed in Procedure 5.3 ).

Graphs tend to be cumbersome and space consuming if a great many variables need to be summarised. In such cases, using numerical summary statistics (such as means or correlations) in tabular form alone will provide a more economical and efficient summary. Also, it can be very easy to give a misleading picture of data trends using graphical methods by simply choosing the ‘correct’ scaling for maximum effect or choosing a display option (such as a 3-D effect) that ‘looks’ presentable but which actually obscures a clear interpretation (see Smithson 2000 ; Wilkinson 2009 ).

Thus, you must be careful in creating and interpreting visual representations so that the influence of aesthetic choices for sake of appearance do not become more important than obtaining a faithful and valid representation of the data—a very real danger with many of today’s statistical packages where ‘default’ drawing options have been pre-programmed in. No single plot can completely summarise all possible characteristics of a sample of data. Thus, choosing a specific method of graphical display may, of necessity, force a behavioural researcher to represent certain data characteristics (such as frequency) at the expense of others (such as averages).

Virtually any research design which produces quantitative data and statistics (even to the extent of just counting the number of occurrences of several events) provides opportunities for graphical data display which may help to clarify or illustrate important data characteristics or relationships. Remember, graphical displays are communication tools just like numbers—which tool to choose depends upon the message to be conveyed. Visual representations of data are generally more useful in communicating to lay persons who are unfamiliar with statistics. Care must be taken though as these same lay people are precisely the people most likely to misinterpret a graph if it has been incorrectly drawn or scaled.

ApplicationProcedures
SPSS and choose from a range of gallery chart types: , ; drag the chart type into the working area and customise the chart with desired variables, labels, etc. many elements of a chart, including error bars, can be controlled.
NCSS or or or or or hichever type of chart you choose, you can control many features of the chart from the dialog box that pops open upon selection.
STATGRAPHICS or or or hichever type of chart you choose, you can control a number of features of the chart from the series of dialog boxes that pops open upon selection.
SYSTAT or or or or or (which offers a range of other more novel graphical displays, including the dual histogram). For each choice, a dialog box opens which allows you to control almost every characteristic of the graph you want.
Commander or or or or ; for some graphs ( being the exception), there is minimal control offered by Commander over the appearance of the graph (you need to use full commands to control more aspects; e.g. see Chang ).

Procedure 5.3: Multivariate Graphs & Displays

Graphical methods for displaying multivariate data (i.e. many variables at once) include scatterplot matrices, radar (or spider) plots, multiplots, parallel coordinate displays, and icon plots. Multivariate graphs are useful for visualising broad trends and patterns across many variables (Cleveland 1995 ; Jacoby 1998 ). Such graphs typically sacrifice precision in representation in favour of a snapshot pictorial summary that can help you form general impressions of data patterns.

It is important to note that what is presented here is a small but reasonably representative sampling of the types of graphs one can produce to summarise and display trends in multivariate data. Generally speaking, SYSTAT offers the best facilities for producing multivariate graphs, followed by STATGRAPHICS, but with the drawback that it is somewhat tricky to get the graphs in exactly the form you want. SYSTAT also has excellent facilities for creating new forms and combinations of graphs – essentially allowing graphs to be tailor-made for a specific communication purpose. Both SPSS and NCSS offer a more limited range of multivariate graphs, generally restricted to scatterplot matrices and variations of multiplots. Microsoft Excel or STATGRAPHICS are the packages to use if radar or spider plots are desired.

Scatterplot Matrices

A scatterplot matrix is a useful multivariate graph designed to show relationships between pairs of many variables in the same display.

Figure 5.10 illustrates a scatterplot matrix, produced using SYSTAT, for the mentabil , accuracy , speed , jobsat and workcond variables in the QCI database. It is easy to see that all the scatterplot matrix does is stack all pairs of scatterplots into a format where it is easy to pick out the graph for any ‘row’ variable that intersects a column ‘variable’.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig10_HTML.jpg

Scatterplot matrix relating mentabil , accuracy , speed , jobsat & workcond

In those plots where a ‘row’ variable intersects itself in a column of the matrix (along the so-called ‘diagonal’), SYSTAT permits a range of univariate displays to be shown. Figure 5.10 shows univariate histograms for each variable (recall Procedure 5.2 ). One obvious drawback of the scatterplot matrix is that, if many variables are to be displayed (say ten or more); the graph gets very crowded and becomes very hard to visually appreciate.

Looking at the first column of graphs in Fig. 5.10 , we can see the scatterplot relationships between mentabil and each of the other variables. We can get a visual impression that mentabil seems to be slightly negatively related to accuracy (the cloud of scatter points tends to angle downward to the right, suggesting, very slightly, that higher mentabil scores are associated with lower levels of accuracy ).

Conversely, the visual impression of the relationship between mentabil and speed is that the relationship is slightly positive (higher mentabil scores tend to be associated with higher speed scores = longer inspection times). Similar types of visual impressions can be formed for other parts of Fig. 5.10 . Notice that the histogram plots along the diagonal give a clear impression of the shape of the distribution for each variable.

Radar Plots

The radar plot (also known as a spider graph for obvious reasons) is a simple and effective device for displaying scores on many variables. Microsoft Excel offers a range of options and capabilities for producing radar plots, such as the plot shown in Fig. 5.11 . Radar plots are generally easy to interpret and provide a good visual basis for comparing plots from different individuals or groups, even if a fairly large number of variables (say, up to about 25) are being displayed. Like a clock face, variables are evenly spaced around the centre of the plot in clockwise order starting at the 12 o’clock position. Visual interpretation of a radar plot primarily relies on shape comparisons, i.e. the rise and fall of peaks and valleys along the spokes around the plot. Valleys near the centre display low scores on specific variables, peaks near the outside of the plot display high scores on specific variables. [Note that, technically, radar plots employ polar coordinates.] SYSTAT can draw graphs using polar coordinates but not as easily as Excel can, from the user’s perspective. Radar plots work best if all the variables represented are measured on the same scale (e.g. a 1 to 7 Likert-type scale or 0% to 100% scale). Individuals who are missing any scores on the variables being plotted are typically omitted.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig11_HTML.jpg

Radar plot comparing attitude ratings for inspectors 66 and 104

The radar plot in Fig. 5.11 , produced using Excel, compares two specific inspectors, 66 and 104, on the nine attitude rating scales. Inspector 66 gave the highest rating (= 7) on the cultqual variable and inspector 104 gave the lowest rating (= 1). The plot shows that inspector 104 tended to provide very low ratings on all nine attitude variables, whereas inspector 66 tended to give very high ratings on all variables except acctrain and trainapp , where the scores were similar to those for inspector 104. Thus, in general, inspector 66 tended to show much more positive attitudes toward their workplace compared to inspector 104.

While Fig. 5.11 was generated to compare the scores for two individuals in the QCI database, it would be just as easy to produce a radar plot that compared the five types of companies in terms of their average ratings on the nine variables, as shown in Fig. 5.12 .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig12_HTML.jpg

Radar plot comparing average attitude ratings for five types of company

Here we can form the visual impression that the five types of companies differ most in their average ratings of mgmtcomm and least in the average ratings of polsatis . Overall, the average ratings from inspectors from PC manufacturers (black diamonds with solid lines) seem to be generally the most positive as their scores lie on or near the outer ring of scores and those from Automobile manufacturers tend to be least positive on many variables (except the training-related variables).

Extrapolating from Fig. 5.12 , you may rightly conclude that including too many groups and/or too many variables in a radar plot comparison can lead to so much clutter that any visual comparison would be severely degraded. You may have to experiment with using colour-coded lines to represent different groups versus line and marker shape variations (as used in Fig. 5.12 ), because choice of coding method for groups can influence the interpretability of a radar plot.

A multiplot is simply a hybrid style of graph that can display group comparisons across a number of variables. There are a wide variety of possible multiplots one could potentially design (SYSTAT offers great capabilities with respect to multiplots). Figure 5.13 shows a multiplot comprising a side-by-side series of profile-based line graphs – one graph for each type of company in the QCI database.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig13_HTML.jpg

Multiplot comparing profiles of average attitude ratings for five company types

The multiplot in Fig. 5.13 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within a specific type of company. This multiplot shows the same story as the radar plot in Fig. 5.12 , but in a different graphical format. It is still fairly clear that the average ratings from inspectors from PC manufacturers tend to be higher than for the other types of companies and the profile for inspectors from automobile manufacturers tends to be lower than for the other types of companies.

The profile for inspectors from large electrical appliance manufacturers is the flattest, meaning that their average attitude ratings were less variable than for other types of companies. Comparing the ease with which you can glean the visual impressions from Figs. 5.12 and 5.13 may lead you to prefer one style of graph over another. If you have such preferences, chances are others will also, which may mean you need to carefully consider your options when deciding how best to display data for effect.

Frequently, choice of graph is less a matter of which style is right or wrong, but more a matter of which style will suit specific purposes or convey a specific story, i.e. the choice is often strategic.

Parallel Coordinate Displays

A parallel coordinate display is useful for displaying individual scores on a range of variables, all measured using the same scale. Furthermore, such graphs can be combined side-by-side to facilitate very broad visual comparisons among groups, while retaining individual profile variability in scores. Each line in a parallel coordinate display represents one individual, e.g. an inspector.

The interpretation of a parallel coordinate display, such as the two shown in Fig. 5.14 , depends on visual impressions of the peaks and valleys (highs and lows) in the profiles as well as on the density of similar profile lines. The graph is called ‘parallel coordinate’ simply because it assumes that all variables are measured on the same scale and that scores for each variable can therefore be located along vertical axes that are parallel to each other (imagine vertical lines on Fig. 5.14 running from bottom to top for each variable on the X-axis). The main drawback of this method of data display is that only those individuals in the sample who provided legitimate scores on all of the variables being plotted (i.e. who have no missing scores) can be displayed.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig14_HTML.jpg

Parallel coordinate displays comparing profiles of average attitude ratings for five company types

The parallel coordinate display in Fig. 5.14 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within two specific types of company: the left graph for inspectors from PC manufacturers and the right graph for automobile manufacturers.

There are fewer lines in each display than the number of inspectors from each type of company simply because several inspectors from each type of company were missing a rating on at least one of the nine attitude variables. The graphs show great variability in scores amongst inspectors within a company type, but there are some overall patterns evident.

For example, inspectors from automobile companies clearly and fairly uniformly rated mgmtcomm toward the low end of the scale, whereas the reverse was generally true for that variable for inspectors from PC manufacturers. Conversely, inspectors from automobile companies tend to rate acctrain and trainapp more toward the middle to high end of the scale, whereas the reverse is generally true for those variables for inspectors from PC manufacturers.

Perhaps the most creative types of multivariate displays are the so-called icon plots . SYSTAT and STATGRAPHICS offer an impressive array of different types of icon plots, including, amongst others, Chernoff’s faces, profile plots, histogram plots, star glyphs and sunray plots (Jacoby 1998 provides a detailed discussion of icon plots).

Icon plots generally use a specific visual construction to represent variables scores obtained by each individual within a sample or group. All icon plots are thus methods for displaying the response patterns for individual members of a sample, as long as those individuals are not missing any scores on the variables to be displayed (note that this is the same limitation as for radar plots and parallel coordinate displays). To illustrate icon plots, without generating too many icons to focus on, Figs. 5.15 , 5.16 , 5.17 and 5.18 present four different icon plots for QCI inspectors classified, using a new variable called BEST_WORST , as either the worst performers (= 1 where their accuracy scores were less than 70%) or the best performers (= 2 where their accuracy scores were 90% or greater).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig15_HTML.jpg

Chernoff’s faces icon plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig16_HTML.jpg

Profile plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig17_HTML.jpg

Histogram plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig18_HTML.jpg

Sunray plot comparing individual attitude ratings for best and worst performing inspectors

The Chernoff’s faces plot gets its name from the visual icon used to represent variable scores – a cartoon-type face. This icon tries to capitalise on our natural human ability to recognise and differentiate faces. Each feature of the face is controlled by the scores on a single variable. In SYSTAT, up to 20 facial features are controllable; the first five being curvature of mouth, angle of brow, width of nose, length of nose and length of mouth (SYSTAT Software Inc., 2009 , p. 259). The theory behind Chernoff’s faces is that similar patterns of variable scores will produce similar looking faces, thereby making similarities and differences between individuals more apparent.

The profile plot and histogram plot are actually two variants of the same type of icon plot. A profile plot represents individuals’ scores for a set of variables using simplified line graphs, one per individual. The profile is scaled so that the vertical height of the peaks and valleys correspond to actual values for variables where the variables anchor the X-axis in a fashion similar to the parallel coordinate display. So, as you examine a profile from left to right across the X-axis of each graph, you are looking across the set of variables. A histogram plot represents the same information in the same way as for the profile plot but using histogram bars instead.

Figure 5.15 , produced using SYSTAT, shows a Chernoff’s faces plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine general attitude statements.

Each face is labelled with the inspector number it represents. The gaps indicate where an inspector had missing data on at least one of the variables, meaning a face could not be generated for them. The worst performers are drawn using red lines; the best using blue lines. The first variable is jobsat and this variable controls mouth curvature; the second variable is workcond and this controls angle of brow, and so on. It seems clear that there are differences in the faces between the best and worst performers with, for example, best performers tending to be more satisfied (smiling) and with higher ratings for working conditions (brow angle).

Beyond a broad visual impression, there is little in terms of precise inferences you can draw from a Chernoff’s faces plot. It really provides a visual sketch, nothing more. The fact that there is no obvious link between facial features, variables and score levels means that the Chernoff’s faces icon plot is difficult to interpret at the level of individual variables – a holistic impression of similarity and difference is what this type of plot facilitates.

Figure 5.16 produced using SYSTAT, shows a profile plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine attitude variables.

Like the Chernoff’s faces plot (Fig. 5.15 ), as you read across the rows of the plot from left to right, each plot corresponds respectively to a inspector in the sample who was either in the worst performer (red) or best performer (blue) category. The first attitude variable is jobsat and anchors the left end of each line graph; the last variable is polsatis and anchors the right end of the line graph. The remaining variables are represented in order from left to right across the X-axis of each graph. Figure 5.16 shows that these inspectors are rather different in their attitude profiles, with best performers tending to show taller profiles on the first two variables, for example.

Figure 5.17 produced using SYSTAT, shows a histogram plot for the best and worst performing inspectors based on their ratings of job satisfaction, working conditions and the nine attitude variables. This plot tells the same story as the profile plot, only using histogram bars. Some people would prefer the histogram icon plot to the profile plot because each histogram bar corresponds to one variable, making the visual linking of a specific bar to a specific variable much easier than visually linking a specific position along the profile line to a specific variable.

The sunray plot is actually a simplified adaptation of the radar plot (called a “star glyph”) used to represent scores on a set of variables for each individual within a sample or group. Remember that a radar plot basically arranges the variables around a central point like a clock face; the first variable is represented at the 12 o’clock position and the remaining variables follow around the plot in a clockwise direction.

Unlike a radar plot, while the spokes (the actual ‘star’ of the glyph’s name) of the plot are visible, no interpretive scale is evident. A variable’s score is visually represented by its distance from the central point. Thus, the star glyphs in a sunray plot are designed, like Chernoff’s faces, to provide a general visual impression, based on icon shape. A wide diameter well-rounded plot indicates an individual with high scores on all variables and a small diameter well-rounded plot vice-versa. Jagged plots represent individuals with highly variable scores across the variables. ‘Stars’ of similar size, shape and orientation represent similar individuals.

Figure 5.18 , produced using STATGRAPHICS, shows a sunray plot for the best and worst performing inspectors. An interpretation glyph is also shown in the lower right corner of Fig. 5.18 , where variables are aligned with the spokes of a star (e.g. jobsat is at the 12 o’clock position). This sunray plot could lead you to form the visual impression that the worst performing inspectors (group 1) have rather less rounded rating profiles than do the best performing inspectors (group 2) and that the jobsat and workcond spokes are generally lower for the worst performing inspectors.

Comparatively speaking, the sunray plot makes identifying similar individuals a bit easier (perhaps even easier than Chernoff’s faces) and, when ordered as STATGRAPHICS showed in Fig. 5.18 , permits easier visual comparisons between groups of individuals, but at the expense of precise knowledge about variable scores. Remember, a holistic impression is the goal pursued using a sunray plot.

Multivariate graphical methods provide summary techniques for visually presenting certain characteristics of a complex array of data on variables. Such visual representations are generally better at helping us to form holistic impressions of multivariate data rather than any sort of tabular representation or numerical index. They also allow us to compress many numerical measures into a finite representation that is generally easy to understand. Multivariate graphical displays can add interest to an otherwise dry statistical reporting of numerical data. They are designed to appeal to our pattern recognition skills, focusing our attention on features of the data such as shape, level, variability and orientation. Some multivariate graphs (e.g. radar plots, sunray plots and multiplots) are useful not only for representing score patterns for individuals but also providing summaries of score patterns across groups of individuals.

Multivariate graphs tend to get very busy-looking and are hard to interpret if a great many variables or a large number of individuals need to be displayed (imagine any of the icon plots, for a sample of 200 questionnaire participants, displayed on a A4 page – each icon would be so small that its features could not be easily distinguished, thereby defeating the purpose of the display). In such cases, using numerical summary statistics (such as averages or correlations) in tabular form alone will provide a more economical and efficient summary. Also, some multivariate displays will work better for conveying certain types of information than others.

Information about variable relationships may be better displayed using a scatterplot matrix. Information about individual similarities and difference on a set of variables may be better conveyed using a histogram or sunray plot. Multiplots may be better suited to displaying information about group differences across a set of variables. Information about the overall similarity of individual entities in a sample might best be displayed using Chernoff’s faces.

Because people differ greatly in their visual capacities and preferences, certain types of multivariate displays will work for some people and not others. Sometimes, people will not see what you see in the plots. Some plots, such as Chernoff’s faces, may not strike a reader as a serious statistical procedure and this could adversely influence how convinced they will be by the story the plot conveys. None of the multivariate displays described here provide sufficiently precise information for solid inferences or interpretations; all are designed to simply facilitate the formation of holistic visual impressions. In fact, you may have noticed that some displays (scatterplot matrices and the icon plots, for example) provide no numerical scaling information that would help make precise interpretations. If precision in summary information is desired, the types of multivariate displays discussed here would not be the best strategic choices.

Virtually any research design which produces quantitative data/statistics for multiple variables provides opportunities for multivariate graphical data display which may help to clarify or illustrate important data characteristics or relationships. Thus, for survey research involving many identically-scaled attitudinal questions, a multivariate display may be just the device needed to communicate something about patterns in the data. Multivariate graphical displays are simply specialised communication tools designed to compress a lot of information into a meaningful and efficient format for interpretation—which tool to choose depends upon the message to be conveyed.

Generally speaking, visual representations of multivariate data could prove more useful in communicating to lay persons who are unfamiliar with statistics or who prefer visual as opposed to numerical information. However, these displays would probably require some interpretive discussion so that the reader clearly understands their intent.

ApplicationProcedures
SPSS and choose from the gallery; drag the chart type into the working area and customise the chart with desired variables, labels, etc. Only a few elements of each chart can be configured and altered.
NCSS Only a few elements of this plot are customisable in NCSS.
SYSTAT (and you can select what type of plot you want to appear in the diagonal boxes) or ( can be selected by choosing a variable. e.g. ) or or (for icon plots, you can choose from a range of icons including Chernoff’s faces, histogram, star, sun or profile amongst others). A large number of elements of each type of plot are easily customisable, although it may take some trial and error to get exactly the look you want.
STATGRAPHICS or or or Several elements of each type of plot are easily customisable, although it may take some trial and error to get exactly the look you want.
commander You can select what type of plot you want to appear in the diagonal boxes, and you can control some other features of the plot. Other multivariate data displays are available via various packages (e.g. the or package), but not through commander.

Procedure 5.4: Assessing Central Tendency

The three most commonly reported measures of central tendency are the mean, median and mode. Each measure reflects a specific way of defining central tendency in a distribution of scores on a variable and each has its own advantages and disadvantages.

The mean is the most widely used measure of central tendency (also called the arithmetic average). Very simply, a mean is the sum of all the scores for a specific variable in a sample divided by the number of scores used in obtaining the sum. The resulting number reflects the average score for the sample of individuals on which the scores were obtained. If one were asked to predict the score that any single individual in the sample would obtain, the best prediction, in the absence of any other relevant information, would be the sample mean. Many parametric statistical methods (such as Procedures 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 and 10.1007/978-981-15-2537-7_7#Sec68) deal with sample means in one way or another. For any sample of data, there is one and only one possible value for the mean in a specific distribution. For most purposes, the mean is the preferred measure of central tendency because it utilises all the available information in a sample.

In the context of the QCI database, Maree could quite reasonably ask what inspectors scored on the average in terms of mental ability ( mentabil ), inspection accuracy ( accuracy ), inspection speed ( speed ), overall job satisfaction ( jobsat ), and perceived quality of their working conditions ( workcond ). Table 5.3 shows the mean scores for the sample of 112 quality control inspectors on each of these variables. The statistics shown in Table 5.3 were computed using the SPSS Frequencies ... procedure. Notice that the table indicates how many of the 112 inspectors had a valid score for each variable and how many were missing a score (e.g. 109 inspectors provided a valid rating for jobsat; 3 inspectors did not).

Measures of central tendency for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab3_HTML.jpg

Each mean needs to be interpreted in terms of the original units of measurement for each variable. Thus, the inspectors in the sample showed an average mental ability score of 109.84 (higher than the general population mean of 100 for the test), an average inspection accuracy of 82.14%, and an average speed for making quality control decisions of 4.48 s. Furthermore, in terms of their work context, inspectors reported an average overall job satisfaction of 4.96 (on the 7-point scale, or a level of satisfaction nearly one full scale point above the Neutral point of 4—indicating a generally positive but not strong level of job satisfaction, and an average perceived quality of work conditions of 4.21 (on the 7-point scale which is just about at the level of Stressful but Tolerable.

The mean is sensitive to the presence of extreme values, which can distort its value, giving a biased indication of central tendency. As we will see below, the median is an alternative statistic to use in such circumstances. However, it is also possible to compute what is called a trimmed mean where the mean is calculated after a certain percentage (say, 5% or 10%) of the lowest and highest scores in a distribution have been ignored (a process called ‘trimming’; see, for example, the discussion in Field 2018 , pp. 262–264). This yields a statistic less influenced by extreme scores. The drawbacks are that the decision as to what percentage to trim can be somewhat subjective and trimming necessarily sacrifices information (i.e. the extreme scores) in order to achieve a less biased measure. Some software packages, such as SPSS, SYSTAT or NCSS, can report a specific percentage trimmed mean, if that option is selected for descriptive statistics or exploratory data analysis (see Procedure 5.6 ) procedures. Comparing the original mean with a trimmed mean can provide an indication of the degree to which the original mean has been biased by extreme values.

Very simply, the median is the centre or middle score of a set of scores. By ‘centre’ or ‘middle’ is meant that 50% of the data values are smaller than or equal to the median and 50% of the data values are larger when the entire distribution of scores is rank ordered from the lowest to highest value. Thus, we can say that the median is that score in the sample which occurs at the 50th percentile. [Note that a ‘percentile’ is attached to a specific score that a specific percentage of the sample scored at or below. Thus, a score at the 25th percentile means that 25% of the sample achieved this score or a lower score.] Table 5.3 shows the 25th, 50th and 75th percentile scores for each variable – note how the 50th percentile score is exactly equal to the median in each case .

The median is reported somewhat less frequently than the mean but does have some advantages over the mean in certain circumstances. One such circumstance is when the sample of data has a few extreme values in one direction (either very large or very small relative to all other scores). In this case, the mean would be influenced (biased) to a much greater degree than would the median since all of the data are used to calculate the mean (including the extreme scores) whereas only the single centre score is needed for the median. For this reason, many nonparametric statistical procedures (such as Procedures 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 and 10.1007/978-981-15-2537-7_7#Sec63) focus on the median as the comparison statistic rather than on the mean.

A discrepancy between the values for the mean and median of a variable provides some insight to the degree to which the mean is being influenced by the presence of extreme data values. In a distribution where there are no extreme values on either side of the distribution (or where extreme values balance each other out on either side of the distribution, as happens in a normal distribution – see Fundamental Concept II ), the mean and the median will coincide at the same value and the mean will not be biased.

For highly skewed distributions, however, the value of the mean will be pulled toward the long tail of the distribution because that is where the extreme values lie. However, in such skewed distributions, the median will be insensitive (statisticians call this property ‘robustness’) to extreme values in the long tail. For this reason, the direction of the discrepancy between the mean and median can give a very rough indication of the direction of skew in a distribution (‘mean larger than median’ signals possible positive skewness; ‘mean smaller than median’ signals possible negative skewness). Like the mean, there is one and only one possible value for the median in a specific distribution.

In Fig. 5.19 , the left graph shows the distribution of speed scores and the right-hand graph shows the distribution of accuracy scores. The speed distribution clearly shows the mean being pulled toward the right tail of the distribution whereas the accuracy distribution shows the mean being just slightly pulled toward the left tail. The effect on the mean is stronger in the speed distribution indicating a greater biasing effect due to some very long inspection decision times.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig19_HTML.jpg

Effects of skewness in a distribution on the values for the mean and median

If we refer to Table 5.3 , we can see that the median score for each of the five variables has also been computed. Like the mean, the median must be interpreted in the original units of measurement for the variable. We can see that for mentabil , accuracy , and workcond , the value of the median is very close to the value of the mean, suggesting that these distributions are not strongly influenced by extreme data values in either the high or low direction. However, note that the median speed was 3.89 s compared to the mean of 4.48 s, suggesting that the distribution of speed scores is positively skewed (the mean is larger than the median—refer to Fig. 5.19 ). Conversely, the median jobsat score was 5.00 whereas the mean score was 4.96 suggesting very little substantive skewness in the distribution (mean and median are nearly equal).

The mode is the simplest measure of central tendency. It is defined as the most frequently occurring score in a distribution. Put another way, it is the score that more individuals in the sample obtain than any other score. An interesting problem associated with the mode is that there may be more than one in a specific distribution. In the case where multiple modes exist, the issue becomes which value do you report? The answer is that you must report all of them. In a ‘normal’ bell-shaped distribution, there is only one mode and it is indeed at the centre of the distribution, coinciding with both the mean and the median.

Table 5.3 also shows the mode for each of the five variables. For example, more inspectors achieved a mentabil score of 111 more often than any other score and inspectors reported a jobsat rating of 6 more often than any other rating. SPSS only ever reports one mode even if several are present, so one must be careful and look at a histogram plot for each variable to make a final determination of the mode(s) for that variable.

All three measures of central tendency yield information about what is going on in the centre of a distribution of scores. The mean and median provide a single number which can summarise the central tendency in the entire distribution. The mode can yield one or multiple indices. With many measurements on individuals in a sample, it is advantageous to have single number indices which can describe the distributions in summary fashion. In a normal or near-normal distribution of sample data, the mean, the median, and the mode will all generally coincide at the one point. In this instance, all three statistics will provide approximately the same indication of central tendency. Note however that it is seldom the case that all three statistics would yield exactly the same number for any particular distribution. The mean is the most useful statistic, unless the data distribution is skewed by extreme scores, in which case the median should be reported.

While measures of central tendency are useful descriptors of distributions, summarising data using a single numerical index necessarily reduces the amount of information available about the sample. Not only do we need to know what is going on in the centre of a distribution, we also need to know what is going on around the centre of the distribution. For this reason, most social and behavioural researchers report not only measures of central tendency, but also measures of variability (see Procedure 5.5 ). The mode is the least informative of the three statistics because of its potential for producing multiple values.

Measures of central tendency are useful in almost any type of experimental design, survey or interview study, and in any observational studies where quantitative data are available and must be summarised. The decision as to whether the mean or median should be reported depends upon the nature of the data which should ideally be ascertained by visual inspection of the data distribution. Some researchers opt to report both measures routinely. Computation of means is a prelude to many parametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 , 10.1007/978-981-15-2537-7_7#Sec52 , 10.1007/978-981-15-2537-7_7#Sec68 , 10.1007/978-981-15-2537-7_7#Sec76 and 10.1007/978-981-15-2537-7_7#Sec105); comparison of medians is associated with many nonparametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 , 10.1007/978-981-15-2537-7_7#Sec63 and 10.1007/978-981-15-2537-7_7#Sec81).

ApplicationProcedures
SPSS then press the ‘ ’ button and choose mean, median and mode. To see trimmed means, you must use the Exploratory Data Analysis procedure; see .
NCSS then select the reports and plots that you want to see; make sure you indicate that you want to see the ‘Means Section’ of the Report. If you want to see trimmed means, tick the ‘Trimmed Section’ of the Report.
SYSTAT … then select the mean, median and mode (as well as any other statistics you might wish to see). If you want to see trimmed means, tick the ‘Trimmed mean’ section of the dialog box and set the percentage to trim in the box labelled ‘Two-sided’.
STATGRAPHICS or then choose the variable(s) you want to describe and select Summary Statistics (you don’t get any options for statistics to report – measures of central tendency and variability are automatically produced). STATGRAPHICS will not report modes and you will need to use and request ‘Percentiles’ in order to see the 50%ile score which will be the median; however, it won’t be labelled as the median.
Commander then select the central tendency statistics you want to see. Commander will not produce modes and to see the median, make sure that the ‘Quantiles’ box is ticked – the .5 quantile score (= 50%ile) score is the median; however, it won’t be labelled as the median.

Procedure 5.5: Assessing Variability

There are a variety of measures of variability to choose from including the range, interquartile range, variance and standard deviation. Each measure reflects a specific way of defining variability in a distribution of scores on a variable and each has its own advantages and disadvantages. Most measures of variability are associated with a specific measure of central tendency so that researchers are now commonly expected to report both a measure of central tendency and its associated measure of variability whenever they display numerical descriptive statistics on continuous or ranked-ordered variables.

This is the simplest measure of variability for a sample of data scores. The range is merely the largest score in the sample minus the smallest score in the sample. The range is the one measure of variability not explicitly associated with any measure of central tendency. It gives a very rough indication as to the extent of spread in the scores. However, since the range uses only two of the total available scores in the sample, the rest of the scores are ignored, which means that a lot of potentially useful information is being sacrificed. There are also problems if either the highest or lowest (or both) scores are atypical or too extreme in their value (as in highly skewed distributions). When this happens, the range gives a very inflated picture of the typical variability in the scores. Thus, the range tends not be a frequently reported measure of variability.

Table 5.4 shows a set of descriptive statistics, produced by the SPSS Frequencies procedure, for the mentabil, accuracy, speed, jobsat and workcond measures in the QCI database. In the table, you will find three rows labelled ‘Range’, ‘Minimum’ and ‘Maximum’.

Measures of central tendency and variability for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab4_HTML.jpg

Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s – the fastest quality decision to 17.10 – the slowest quality decision). Accuracy scores had a range of 43 (from 57% – the least accurate inspector to 100% – the most accurate inspector). Both work context measures ( jobsat and workcond ) exhibited a range of 6 – the largest possible range given the 1 to 7 scale of measurement for these two variables.

Interquartile Range

The Interquartile Range ( IQR ) is a measure of variability that is specifically designed to be used in conjunction with the median. The IQR also takes care of the extreme data problem which typically plagues the range measure. The IQR is defined as the range that is covered by the middle 50% of scores in a distribution once the scores have been ranked in order from lowest value to highest value. It is found by locating the value in the distribution at or below which 25% of the sample scored and subtracting this number from the value in the distribution at or below which 75% of the sample scored. The IQR can also be thought of as the range one would compute after the bottom 25% of scores and the top 25% of scores in the distribution have been ‘chopped off’ (or ‘trimmed’ as statisticians call it).

The IQR gives a much more stable picture of the variability of scores and, like the median, is relatively insensitive to the biasing effects of extreme data values. Some behavioural researchers prefer to divide the IQR in half which gives a measure called the Semi-Interquartile Range ( S-IQR ) . The S-IQR can be interpreted as the distance one must travel away from the median, in either direction, to reach the value which separates the top (or bottom) 25% of scores in the distribution from the remaining 75%.

The IQR or S-IQR is typically not produced by descriptive statistics procedures by default in many computer software packages; however, it can usually be requested as an optional statistic to report or it can easily be computed by hand using percentile scores. Both the median and the IQR figure prominently in Exploratory Data Analysis, particularly in the production of boxplots (see Procedure 5.6 ).

Figure 5.20 illustrates the conceptual nature of the IQR and S-IQR compared to that of the range. Assume that 100% of data values are covered by the distribution curve in the figure. It is clear that these three measures would provide very different values for a measure of variability. Your choice would depend on your purpose. If you simply want to signal the overall span of scores between the minimum and maximum, the range is the measure of choice. But if you want to signal the variability around the median, the IQR or S-IQR would be the measure of choice.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig20_HTML.jpg

How the range, IQR and S-IQR measures of variability conceptually differ

Note: Some behavioural researchers refer to the IQR as the hinge-spread (or H-spread ) because of its use in the production of boxplots:

  • the 25th percentile data value is referred to as the ‘lower hinge’;
  • the 75th percentile data value is referred to as the ‘upper hinge’; and
  • their difference gives the H-spread.

Midspread is another term you may see used as a synonym for interquartile range.

Referring back to Table 5.4 , we can find statistics reported for the median and for the ‘quartiles’ (25th, 50th and 75th percentile scores) for each of the five variables of interest. The ‘quartile’ values are useful for finding the IQR or S-IQR because SPSS does not report these measures directly. The median clearly equals the 50th percentile data value in the table.

If we focus, for example, on the speed variable, we could find its IQR by subtracting the 25th percentile score of 2.19 s from the 75th percentile score of 5.71 s to give a value for the IQR of 3.52 s (the S-IQR would simply be 3.52 divided by 2 or 1.76 s). Thus, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores spanning a range of 3.52 s. Alternatively, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores which ranged 1.76 s either side of the median value.

Note: We could compare the ‘Minimum’ or ‘Maximum’ scores to the 25th percentile score and 75th percentile score respectively to get a feeling for whether the minimum or maximum might be considered extreme or uncharacteristic data values.

The variance uses information from every individual in the sample to assess the variability of scores relative to the sample mean. Variance assesses the average squared deviation of each score from the mean of the sample. Deviation refers to the difference between an observed score value and the mean of the sample—they are squared simply because adding them up in their naturally occurring unsquared form (where some differences are positive and others are negative) always gives a total of zero, which is useless for an index purporting to measure something.

If many scores are quite different from the mean, we would expect the variance to be large. If all the scores lie fairly close to the sample mean, we would expect a small variance. If all scores exactly equal the mean (i.e. all the scores in the sample have the same value), then we would expect the variance to be zero.

Figure 5.21 illustrates some possibilities regarding variance of a distribution of scores having a mean of 100. The very tall curve illustrates a distribution with small variance. The distribution of medium height illustrates a distribution with medium variance and the flattest distribution ia a distribution with large variance.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig21_HTML.jpg

The concept of variance

If we had a distribution with no variance, the curve would simply be a vertical line at a score of 100 (meaning that all scores were equal to the mean). You can see that as variance increases, the tails of the distribution extend further outward and the concentration of scores around the mean decreases. You may have noticed that variance and range (as well as the IQR) will be related, since the range focuses on the difference between the ends of the two tails in the distribution and larger variances extend the tails. So, a larger variance will generally be associated with a larger range and IQR compared to a smaller variance.

It is generally difficult to descriptively interpret the variance measure in a meaningful fashion since it involves squared deviations around the sample mean. [Note: If you look back at Table 5.4 , you will see the variance listed for each of the variables (e.g. the variance of accuracy scores is 84.118), but the numbers themselves make little sense and do not relate to the original measurement scale for the variables (which, for the accuracy variable, went from 0% to 100% accuracy).] Instead, we use the variance as a steppingstone for obtaining a measure of variability that we can clearly interpret, namely the standard deviation . However, you should know that variance is an important concept in its own right simply because it provides the statistical foundation for many of the correlational procedures and statistical inference procedures described in Chaps. 10.1007/978-981-15-2537-7_6 , 10.1007/978-981-15-2537-7_7 and 10.1007/978-981-15-2537-7_8.

When considering either correlations or tests of statistical hypotheses, we frequently speak of one variable explaining or sharing variance with another (see Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec47 ). In doing so, we are invoking the concept of variance as set out here—what we are saying is that variability in the behaviour of scores on one particular variable may be associated with or predictive of variability in scores on another variable of interest (e.g. it could explain why those scores have a non-zero variance).

Standard Deviation

The standard deviation (often abbreviated as SD, sd or Std. Dev.) is the most commonly reported measure of variability because it has a meaningful interpretation and is used in conjunction with reports of sample means. Variance and standard deviation are closely related measures in that the standard deviation is found by taking the square root of the variance. The standard deviation, very simply, is a summary number that reflects the ‘average distance of each score from the mean of the sample’. In many parametric statistical methods, both the sample mean and sample standard deviation are employed in some form. Thus, the standard deviation is a very important measure, not only for data description, but also for hypothesis testing and the establishment of relationships as well.

Referring again back to Table 5.4 , we’ll focus on the results for the speed variable for discussion purposes. Table 5.4 shows that the mean inspection speed for the QCI sample was 4.48 s. We can also see that the standard deviation (in the row labelled ‘Std Deviation’) for speed was 2.89 s.

This standard deviation has a straightforward interpretation: we would say that ‘on the average, an inspector’s quality inspection decision speed differed from the mean of the sample by about 2.89 s in either direction’. In a normal distribution of scores (see Fundamental Concept II ), we would expect to see about 68% of all inspectors having decision speeds between 1.59 s (the mean minus one amount of the standard deviation) and 7.37 s (the mean plus one amount of the standard deviation).

We noted earlier that the range of the speed scores was 16.05 s. However, the fact that the maximum speed score was 17.1 s compared to the 75th percentile score of just 5.71 s seems to suggest that this maximum speed might be rather atypically large compared to the bulk of speed scores. This means that the range is likely to be giving us a false impression of the overall variability of the inspectors’ decision speeds.

Furthermore, given that the mean speed score was higher than the median speed score, suggesting that speed scores were positively skewed (this was confirmed by the histogram for speed shown in Fig. 5.19 in Procedure 5.4 ), we might consider emphasising the median and its associated IQR or S-IQR rather than the mean and standard deviation. Of course, similar diagnostic and interpretive work could be done for each of the other four variables in Table 5.4 .

Measures of variability (particularly the standard deviation) provide a summary measure that gives an indication of how variable (spread out) a particular sample of scores is. When used in conjunction with a relevant measure of central tendency (particularly the mean), a reasonable yet economical description of a set of data emerges. When there are extreme data values or severe skewness is present in the data, the IQR (or S-IQR) becomes the preferred measure of variability to be reported in conjunction with the sample median (or 50th percentile value). These latter measures are much more resistant (‘robust’) to influence by data anomalies than are the mean and standard deviation.

As mentioned above, the range is a very cursory index of variability, thus, it is not as useful as variance or standard deviation. Variance has little meaningful interpretation as a descriptive index; hence, standard deviation is most often reported. However, the standard deviation (or IQR) has little meaning if the sample mean (or median) is not reported along with it.

Knowing that the standard deviation for accuracy is 9.17 tells you little unless you know the mean accuracy (82.14) that it is the standard deviation from.

Like the sample mean, the standard deviation can be strongly biased by the presence of extreme data values or severe skewness in a distribution in which case the median and IQR (or S-IQR) become the preferred measures. The biasing effect will be most noticeable in samples which are small in size (say, less than 30 individuals) and far less noticeable in large samples (say, in excess of 200 or 300 individuals). [Note that, in a manner similar to a trimmed mean, it is possible to compute a trimmed standard deviation to reduce the biasing effect of extreme data values, see Field 2018 , p. 263.]

It is important to realise that the resistance of the median and IQR (or S-IQR) to extreme values is only gained by deliberately sacrificing a good deal of the information available in the sample (nothing is obtained without a cost in statistics). What is sacrificed is information from all other members of the sample other than those members who scored at the median and 25th and 75th percentile points on a variable of interest; information from all members of the sample would automatically be incorporated in mean and standard deviation for that variable.

Any investigation where you might report on or read about measures of central tendency on certain variables should also report measures of variability. This is particularly true for data from experiments, quasi-experiments, observational studies and questionnaires. It is important to consider measures of central tendency and measures of variability to be inextricably linked—one should never report one without the other if an adequate descriptive summary of a variable is to be communicated.

Other descriptive measures, such as those for skewness and kurtosis 1 may also be of interest if a more complete description of any variable is desired. Most good statistical packages can be instructed to report these additional descriptive measures as well.

Of all the statistics you are likely to encounter in the business, behavioural and social science research literature, means and standard deviations will dominate as measures for describing data. Additionally, these statistics will usually be reported when any parametric tests of statistical hypotheses are presented as the mean and standard deviation provide an appropriate basis for summarising and evaluating group differences.

ApplicationProcedures
SPSS then press the ‘ ’ button and choose Std. Deviation, Variance, Range, Minimum and/or Maximum as appropriate. SPSS does not produce or have an option to produce either the IQR or S-IQR, however, if your request ‘Quantiles’ you will see the 25th and 75th %ile scores, which can then be used to quickly compute either variability measure. Remember to select appropriate central tendency measures as well.
NCSS then select the reports and plots that you want to see; make sure you indicate that you want to see the Variance Section of the Report. Remember to select appropriate central tendency measures as well (by opting to see the Means Section of the Report).
SYSTAT … then select SD, Variance, Range, Interquartile range, Minimum and/or Maximum as appropriate. Remember to select appropriate central tendency measures as well.
STATGRAPHICS or then choose the variable(s) you want to describe and select Summary Statistics (you don’t get any options for statistics to report – measures of central tendency and variability are automatically produced). STATGRAPHICS does not produce either the IQR or S-IQR, however, if you use Percentiles’ can be requested in order to see the 25th and 75th %ile scores, which can then be used to quickly compute either variability measure.
Commander then select either the Standard Deviation or Interquartile Range as appropriate. Commander will not produce the range statistic or report minimum or maximum scores. Remember to select appropriate central tendency measures as well.

Fundamental Concept I: Basic Concepts in Probability

The concept of simple probability.

In Procedures 5.1 and 5.2 , you encountered the idea of the frequency of occurrence of specific events such as particular scores within a sample distribution. Furthermore, it is a simple operation to convert the frequency of occurrence of a specific event into a number representing the relative frequency of that event. The relative frequency of an observed event is merely the number of times the event is observed divided by the total number of times one makes an observation. The resulting number ranges between 0 and 1 but we typically re-express this number as a percentage by multiplying it by 100%.

In the QCI database, Maree Lakota observed data from 112 quality control inspectors of which 58 were male and 51 were female (gender indications were missing for three inspectors). The statistics 58 and 51 are thus the frequencies of occurrence for two specific types of research participant, a male inspector or a female inspector.

If she divided each frequency by the total number of observations (i.e. 112), whe would obtain .52 for males and .46 for females (leaving .02 of observations with unknown gender). These statistics are relative frequencies which indicate the proportion of times that Maree obtained data from a male or female inspector. Multiplying each relative frequency by 100% would yield 52% and 46% which she could interpret as indicating that 52% of her sample was male and 46% was female (leaving 2% of the sample with unknown gender).

It does not take much of a leap in logic to move from the concept of ‘relative frequency’ to the concept of ‘probability’. In our discussion above, we focused on relative frequency as indicating the proportion or percentage of times a specific category of participant was obtained in a sample. The emphasis here is on data from a sample.

Imagine now that Maree had infinite resources and research time and was able to obtain ever larger samples of quality control inspectors for her study. She could still compute the relative frequencies for obtaining data from males and females in her sample but as her sample size grew larger and larger, she would notice these relative frequencies converging toward some fixed values.

If, by some miracle, Maree could observe all of the quality control inspectors on the planet today, she would have measured the entire population and her computations of relative frequency for males and females would yield two precise numbers, each indicating the proportion of the population of inspectors that was male and the proportion that was female.

If Maree were then to list all of these inspectors and randomly choose one from the list, the chances that she would choose a male inspector would be equal to the proportion of the population of inspectors that was male and this logic extends to choosing a female inspector. The number used to quantify this notion of ‘chances’ is called a probability. Maree would therefore have established the probability of randomly observing a male or a female inspector in the population on any specific occasion.

Probability is expressed on a 0.0 (the observation or event will certainly not be seen) to 1.0 (the observation or event will certainly be seen) scale where values close to 0.0 indicate observations that are less certain to be seen and values close to 1.0 indicate observations that are more certain to be seen (a value of .5 indicates an even chance that an observation or event will or will not be seen – a state of maximum uncertainty). Statisticians often interpret a probability as the likelihood of observing an event or type of individual in the population.

In the QCI database, we noted that the relative frequency of observing males was .52 and for females was .46. If we take these relative frequencies as estimates of the proportions of each gender in the population of inspectors, then .52 and .46 represent the probability of observing a male or female inspector, respectively.

Statisticians would state this as “the probability of observing a male quality control inspector is .52” or in a more commonly used shorthand code, the likelihood of observing a male quality control inspector is p = .52 (p for probability). For some, probabilities make more sense if they are converted to percentages (by multiplying by 100%). Thus, p = .52 can also understood as a 52% chance of observing a male quality control inspector.

We have seen that relative frequency is a sample statistic that can be used to estimate the population probability. Our estimate will get more precise as we use larger and larger samples (technically, as the size of our samples more closely approximates the size of our population). In most behavioural research, we never have access to entire populations so we must always estimate our probabilities.

In some very special populations, having a known number of fixed possible outcomes, such as results of coin tosses or rolls of a die, we can analytically establish event probabilities without doing an infinite number of observations; all we must do is assume that we have a fair coin or die. Thus, with a fair coin, the probability of observing a H or a T on any single coin toss is ½ or .5 or 50%; the probability of observing a 6 on any single throw of a die is 1/6 or .16667 or 16.667%. With behavioural data, though, we can never measure all possible behavioural outcomes, which thereby forces researchers to depend on samples of observations in order to make estimates of population values.

The concept of probability is central to much of what is done in the statistical analysis of behavioural data. Whenever a behavioural scientist wishes to establish whether a particular relationship exists between variables or whether two groups, treated differently, actually show different behaviours, he/she is playing a probability game. Given a sample of observations, the behavioural scientist must decide whether what he/she has observed is providing sufficient information to conclude something about the population from which the sample was drawn.

This decision always has a non-zero probability of being in error simply because in samples that are much smaller than the population, there is always the chance or probability that we are observing something rare and atypical instead of something which is indicative of a consistent population trend. Thus, the concept of probability forms the cornerstone for statistical inference about which we will have more to say later (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec6). Probability also plays an important role in helping us to understand theoretical statistical distributions (e.g. the normal distribution) and what they can tell us about our observations. We will explore this idea further in Fundamental Concept II .

The Concept of Conditional Probability

It is important to understand that the concept of probability as described above focuses upon the likelihood or chances of observing a specific event or type of observation for a specific variable relative to a population or sample of observations. However, many important behavioural research issues may focus on the question of the probability of observing a specific event given that the researcher has knowledge that some other event has occurred or been observed (this latter event is usually measured by a second variable). Here, the focus is on the potential relationship or link between two variables or two events.

With respect to the QCI database, Maree could ask the quite reasonable question “what is the probability (estimated in the QCI sample by a relative frequency) of observing an inspector being female given that she knows that an inspector works for a Large Business Computer manufacturer.

To address this question, all she needs to know is:

  • how many inspectors from Large Business Computer manufacturers are in the sample ( 22 ); and
  • how many of those inspectors were female ( 7 ) (inspectors who were missing a score for either company or gender have been ignored here).

If she divides 7 by 22, she would obtain the probability that an inspector is female given that they work for a Large Business Computer manufacturer – that is, p = .32 .

This type of question points to the important concept of conditional probability (‘conditional’ because we are asking “what is the probability of observing one event conditional upon our knowledge of some other event”).

Continuing with the previous example, Maree would say that the conditional probability of observing a female inspector working for a Large Business Computer manufacturer is .32 or, equivalently, a 32% chance. Compare this conditional probability of p  = .32 to the overall probability of observing a female inspector in the entire sample ( p  = .46 as shown above).

This means that there is evidence for a connection or relationship between gender and the type of company an inspector works for. That is, the chances are lower for observing a female inspector from a Large Business Computer manufacturer than they are for simply observing a female inspector at all.

Maree therefore has evidence suggesting that females may be relatively under-represented in Large Business Computer manufacturing companies compared to the overall population. Knowing something about the company an inspector works for therefore can help us make a better prediction about their likely gender.

Suppose, however, that Maree’s conditional probability had been exactly equal to p  = .46. This would mean that there was exactly the same chance of observing a female inspector working for a Large Business Computer manufacturer as there was of observing a female inspector in the general population. Here, knowing something about the company an inspector works doesn’t help Maree make any better prediction about their likely gender. This would mean that the two variables are statistically independent of each other.

A classic case of events that are statistically independent is two successive throws of a fair die: rolling a six on the first throw gives us no information for predicting how likely it will be that we would roll a six on the second throw. The conditional probability of observing a six on the second throw given that I have observed a six on the first throw is 0.16667 (= 1 divided by 6) which is the same as the simple probability of observing a six on any specific throw. This statistical independence also means that if we wanted to know what the probability of throwing two sixes on two successive throws of a fair die, we would just multiply the probabilities for each independent event (i.e., throw) together; that is, .16667 × .16667 = .02789 (this is known as the multiplication rule of probability, see, for example, Smithson 2000 , p. 114).

Finally, you should know that conditional probabilities are often asymmetric. This means that for many types of behavioural variables, reversing the conditional arrangement will change the story about the relationship. Bayesian statistics (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec73) relies heavily upon this asymmetric relationship between conditional probabilities.

Maree has already learned that the conditional probability that an inspector is female given that they worked for a Large Business Computer manufacturer is p = .32. She could easily turn the conditional relationship around and ask what is the conditional probability that an inspector works for a Large Business Computer manufacturer given that the inspector is female?

From the QCI database, she can find that 51 inspectors in her total sample were female and of those 51, 7 worked for a Large Business Computer manufacturer. If she divided 7 by 51, she would get p = .14 (did you notice that all that changed was the number she divided by?). Thus, there is only a 14% chance of observing an inspector working for a Large Business Computer manufacturer given that the inspector is female – a rather different probability from p = .32, which tells a different story.

As you will see in Procedures 10.1007/978-981-15-2537-7_6#Sec14 and 10.1007/978-981-15-2537-7_7#Sec17, conditional relationships between categorical variables are precisely what crosstabulation contingency tables are designed to reveal.

Procedure 5.6: Exploratory Data Analysis

There are a variety of visual display methods for EDA, including stem & leaf displays, boxplots and violin plots. Each method reflects a specific way of displaying features of a distribution of scores or measurements and, of course, each has its own advantages and disadvantages. In addition, EDA displays are surprisingly flexible and can combine features in various ways to enhance the story conveyed by the plot.

Stem & Leaf Displays

The stem & leaf display is a simple data summary technique which not only rank orders the data points in a sample but presents them visually so that the shape of the data distribution is reflected. Stem & leaf displays are formed from data scores by splitting each score into two parts: the first part of each score serving as the ‘stem’, the second part as the ‘leaf’ (e.g. for 2-digit data values, the ‘stem’ is the number in the tens position; the ‘leaf’ is the number in the ones position). Each stem is then listed vertically, in ascending order, followed horizontally by all the leaves in ascending order associated with it. The resulting display thus shows all of the scores in the sample, but reorganised so that a rough idea of the shape of the distribution emerges. As well, extreme scores can be easily identified in a stem & leaf display.

Consider the accuracy and speed scores for the 112 quality control inspectors in the QCI sample. Figure 5.22 (produced by the R Commander Stem-and-leaf display … procedure) shows the stem & leaf displays for inspection accuracy (left display) and speed (right display) data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig22_HTML.jpg

Stem & leaf displays produced by R Commander

[The first six lines reflect information from R Commander about each display: lines 1 and 2 show the actual R command used to produce the plot (the variable name has been highlighted in bold); line 3 gives a warning indicating that inspectors with missing values (= NA in R ) on the variable have been omitted from the display; line 4 shows how the stems and leaves have been defined; line 5 indicates what a leaf unit represents in value; and line 6 indicates the total number (n) of inspectors included in the display).] In Fig. 5.22 , for the accuracy display on the left-hand side, the ‘stems’ have been split into ‘half-stems’—one (which is starred) associated with the ‘leaves’ 0 through 4 and the other associated with the ‘leaves’ 5 through 9—a strategy that gives the display better balance and visual appeal.

Notice how the left stem & leaf display conveys a fairly clear (yet sideways) picture of the shape of the distribution of accuracy scores. It has a rather symmetrical bell-shape to it with only a slight suggestion of negative skewness (toward the extreme score at the top). The right stem & leaf display clearly depicts the highly positively skewed nature of the distribution of speed scores. Importantly, we could reconstruct the entire sample of scores for each variable using its display, which means that unlike most other graphical procedures, we didn’t have to sacrifice any information to produce the visual summary.

Some programs, such as SYSTAT, embellish their stem & leaf displays by indicating in which stem or half-stem the ‘median’ (50th percentile), the ‘upper hinge score’ (75th percentile), and ‘lower hinge score’ (25th percentile) occur in the distribution (recall the discussion of interquartile range in Procedure 5.5 ). This is shown in Fig. 5.23 , produced by SYSTAT, where M and H indicate the stem locations for the median and hinge points, respectively. This stem & leaf display labels a single extreme accuracy score as an ‘outside value’ and clearly shows that this actual score was 57.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig23_HTML.jpg

Stem & leaf display, produced by SYSTAT, of the accuracy QCI variable

Another important EDA technique is the boxplot or, as it is sometimes known, the box-and-whisker plot . This plot provides a symbolic representation that preserves less of the original nature of the data (compared to a stem & leaf display) but typically gives a better picture of the distributional characteristics. The basic boxplot, shown in Fig. 5.24 , utilises information about the median (50th percentile score) and the upper (75th percentile score) and lower (25th percentile score) hinge points in the construction of the ‘box’ portion of the graph (the ‘median’ defines the centre line in the box; the ‘upper’ and ‘lower hinge values’ define the end boundaries of the box—thus the box encompasses the middle 50% of data values).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig24_HTML.jpg

Boxplots for the accuracy and speed QCI variables

Additionally, the boxplot utilises the IQR (recall Procedure 5.5 ) as a way of defining what are called ‘fences’ which are used to indicate score boundaries beyond which we would consider a score in a distribution to be an ‘outlier’ (or an extreme or unusual value). In SPSS, the inner fence is typically defined as 1.5 times the IQR in each direction and a ‘far’ outlier or extreme case is typically defined as 3 times the IQR in either direction (Field 2018 , p. 193). The ‘whiskers’ in a boxplot extend out to the data values which are closest to the upper and lower inner fences (in most cases, the vast majority of data values will be contained within the fences). Outliers beyond these ‘whiskers’ are then individually listed. ‘Near’ outliers are those lying just beyond the inner fences and ‘far’ outliers lie well beyond the inner fences.

Figure 5.24 shows two simple boxplots (produced using SPSS), one for the accuracy QCI variable and one for the speed QCI variable. The accuracy plot shows a median value of about 83, roughly 50% of the data fall between about 77 and 89 and there is one outlier, inspector 83, in the lower ‘tail’ of the distribution. The accuracy boxplot illustrates data that are relatively symmetrically distributed without substantial skewness. Such data will tend to have their median in the middle of the box, whiskers of roughly equal length extending out from the box and few or no outliers.

The speed plot shows a median value of about 4 s, roughly 50% of the data fall between 2 s and 6 s and there are four outliers, inspectors 7, 62, 65 and 75 (although inspectors 65 and 75 fall at the same place and are rather difficult to read), all falling in the slow speed ‘tail’ of the distribution. Inspectors 65, 75 and 7 are shown as ‘near’ outliers (open circles) whereas inspector 62 is shown as a ‘far’ outlier (asterisk). The speed boxplot illustrates data which are asymmetrically distributed because of skewness in one direction. Such data may have their median offset from the middle of the box and/or whiskers of unequal length extending out from the box and outliers in the direction of the longer whisker. In the speed boxplot, the data are clearly positively skewed (the longer whisker and extreme values are in the slow speed ‘tail’).

Boxplots are very versatile representations in that side-by-side displays for sub-groups of data within a sample can permit easy visual comparisons of groups with respect to central tendency and variability. Boxplots can also be modified to incorporate information about error bands associated with the median producing what is called a ‘notched boxplot’. This helps in the visual detection of meaningful subgroup differences, where boxplot ‘notches’ don’t overlap.

Figure 5.25 (produced using NCSS), compares the distributions of accuracy and speed scores for QCI inspectors from the five types of companies, plotted side-by-side.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig25_HTML.jpg

Comparisons of the accuracy (regular boxplots) and speed (notched boxplots) QCI variables for different types of companies

Focus first on the left graph in Fig. 5.25 which plots the distribution of accuracy scores broken down by company using regular boxplots. This plot clearly shows the differing degree of skewness in each type of company (indicated by one or more outliers in one ‘tail’, whiskers which are not the same length and/or the median line being offset from the centre of a box), the differing variability of scores within each type of company (indicated by the overall length of each plot—box and whiskers), and the differing central tendency in each type of company (the median lines do not all fall at the same level of accuracy score). From the left graph in Fig. 5.25 , we could conclude that: inspection accuracy scores are most variable in PC and Large Electrical Appliance manufacturing companies and least variable in the Large Business Computer manufacturing companies; Large Business Computer and PC manufacturing companies have the highest median level of inspection accuracy; and inspection accuracy scores tend to be negatively skewed (many inspectors toward higher levels, relatively fewer who are poorer in inspection performance) in the Automotive manufacturing companies. One inspector, working for an Automotive manufacturing company, shows extremely poor inspection accuracy performance.

The right display compares types of companies in terms of their inspection speed scores, using’ notched’ boxplots. The notches define upper and lower error limits around each median. Aside from the very obvious positive skewness for speed scores (with a number of slow speed outliers) in every type of company (least so for Large Electrical Appliance manufacturing companies), the story conveyed by this comparison is that inspectors from Large Electrical Appliance and Automotive manufacturing companies have substantially faster median decision speeds compared to inspectors from Large Business Computer and PC manufacturing companies (i.e. their ‘notches’ do not overlap, in terms of speed scores, on the display).

Boxplots can also add interpretive value to other graphical display methods through the creation of hybrid displays. Such displays might combine a standard histogram with a boxplot along the X-axis to provide an enhanced picture of the data distribution as illustrated for the mentabil variable in Fig. 5.26 (produced using NCSS). This hybrid plot also employs a data ‘smoothing’ method called a density trace to outline an approximate overall shape for the data distribution. Any one graphical method would tell some of the story, but combined in the hybrid display, the story of a relatively symmetrical set of mentabil scores becomes quite visually compelling.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig26_HTML.jpg

A hybrid histogram-density-boxplot of the mentabil QCI variable

Violin Plots

Violin plots are a more recent and interesting EDA innovation, implemented in the NCSS software package (Hintze 2012 ). The violin plot gets its name from the rough shape that the plots tend to take on. Violin plots are another type of hybrid plot, this time combining density traces (mirror-imaged right and left so that the plots have a sense of symmetry and visual balance) with boxplot-type information (median, IQR and upper and lower inner ‘fences’, but not outliers). The goal of the violin plot is to provide a quick visual impression of the shape, central tendency and variability of a distribution (the length of the violin conveys a sense of the overall variability whereas the width of the violin conveys a sense of the frequency of scores occurring in a specific region).

Figure 5.27 (produced using NCSS), compares the distributions of speed scores for QCI inspectors across the five types of companies, plotted side-by-side. The violin plot conveys a similar story to the boxplot comparison for speed in the right graph of Fig. 5.25 . However, notice that with the violin plot, unlike with a boxplot, you also get a sense of distributions that have ‘clumps’ of scores in specific areas. Some violin plots, like that for Automobile manufacturing companies in Fig. 5.27 , have a shape suggesting a multi-modal distribution (recall Procedure 5.4 and the discussion of the fact that a distribution may have multiple modes). The violin plot in Fig. 5.27 has also been produced to show where the median (solid line) and mean (dashed line) would fall within each violin. This facilitates two interpretations: (1) a relative comparison of central tendency across the five companies and (2) relative degree of skewness in the distribution for each company (indicated by the separation of the two lines within a violin; skewness is particularly bad for the Large Business Computer manufacturing companies).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig27_HTML.jpg

Violin plot comparisons of the speed QCI variable for different types of companies

EDA methods (of which we have illustrated only a small subset; we have not reviewed dot density diagrams, for example) provide summary techniques for visually displaying certain characteristics of a set of data. The advantage of the EDA methods over more traditional graphing techniques such as those described in Procedure 5.2 is that as much of the original integrity of the data is maintained as possible while maximising the amount of summary information available about distributional characteristics.

Stem & leaf displays maintain the data in as close to their original form as possible whereas boxplots and violin plots provide more symbolic and flexible representations. EDA methods are best thought of as communication devices designed to facilitate quick visual impressions and they can add interest to any statistical story being conveyed about a sample of data. NCSS, SYSTAT, STATGRAPHICS and R Commander generally offer more options and flexibility in the generation of EDA displays than SPSS.

EDA methods tend to get cumbersome if a great many variables or groups need to be summarised. In such cases, using numerical summary statistics (such as means and standard deviations) will provide a more economical and efficient summary. Boxplots or violin plots are generally more space efficient summary techniques than stem & leaf displays.

Often, EDA techniques are used as data screening devices, which are typically not reported in actual write-ups of research (we will discuss data screening in more detail in Procedure 10.1007/978-981-15-2537-7_8#Sec11). This is a perfectly legitimate use for the methods although there is an argument for researchers to put these techniques to greater use in published literature.

Software packages may use different rules for constructing EDA plots which means that you might get rather different looking plots and different information from different programs (you saw some evidence of this in Figs. 5.22 and 5.23 ). It is important to understand what the programs are using as decision rules for locating fences and outliers so that you are clear on how best to interpret the resulting plot—such information is generally contained in the user’s guides or manuals for NCSS (Hintze 2012 ), SYSTAT (SYSTAT Inc. 2009a , b ), STATGRAPHICS (StatPoint Technologies Inc. 2010 ) and SPSS (Norušis 2012 ).

Virtually any research design which produces numerical measures (even to the extent of just counting the number of occurrences of several events) provides opportunities for employing EDA displays which may help to clarify data characteristics or relationships. One extremely important use of EDA methods is as data screening devices for detecting outliers and other data anomalies, such as non-normality and skewness, before proceeding to parametric statistical analyses. In some cases, EDA methods can help the researcher to decide whether parametric or nonparametric statistical tests would be best to apply to his or her data because critical data characteristics such as distributional shape and spread are directly reflected.

ApplicationProcedures
SPSS

produces stem-and-leaf displays and boxplots by default; variables may be explored on a whole-of-sample basis or broken down by the categories of a specific variable (called a ‘factor’ in the procedure). Cases can also be labelled with a variable (like in the QCI database), so that outlier points in the boxplot are identifiable.

can also be used to custom build different types of boxplots.

NCSS

produces a stem-and-leaf display by default.

can be used to produce box plots with different features (such as ‘notches’ and connecting lines).

can be configured to produce violin plots (by selecting the plot shape as ‘density with reflection’).

SYSTAT

can be used to produce stem-and-leaf displays for variables; however, you cannot really control any features of these displays.

can be used to produce boxplots of many types, with a number of features being controllable.

STATGRAPHICS

allows you to do a complete exploration of a single variable, including stem-and-leaf display (you need to select this option) and boxplot (produced by default). Some features of the boxplot can be controlled, but not features of the stem-and-leaf diagram.

and select either or which can produce not only descriptive statistics but also boxplots with some controllable features.

Commander or the dialog box for each procedure offers some features of the display or plot that can be controlled; whole-of-sample boxplots or boxplots by groups are possible.

Procedure 5.7: Standard ( z ) Scores

In certain practical situations in behavioural research, it may be desirable to know where a specific individual’s score lies relative to all other scores in a distribution. A convenient measure is to observe how many standard deviations (see Procedure 5.5 ) above or below the sample mean a specific score lies. This measure is called a standard score or z -score . Very simply, any raw score can be converted to a z -score by subtracting the sample mean from the raw score and dividing that result by the sample’s standard deviation. z -scores can be positive or negative and their sign simply indicates whether the score lies above (+) or below (−) the mean in value. A z -score has a very simple interpretation: it measures the number of standard deviations above or below the sample mean a specific raw score lies.

In the QCI database, we have a sample mean for speed scores of 4.48 s, a standard deviation for speed scores of 2.89 s (recall Table 5.4 in Procedure 5.5 ). If we are interested in the z -score for Inspector 65’s raw speed score of 11.94 s, we would obtain a z -score of +2.58 using the method described above (subtract 4.48 from 11.94 and divide the result by 2.89). The interpretation of this number is that a raw decision speed score of 11.94 s lies about 2.9 standard deviations above the mean decision speed for the sample.

z -scores have some interesting properties. First, if one converts (statisticians would say ‘transforms’) every available raw score in a sample to z -scores, the mean of these z -scores will always be zero and the standard deviation of these z -scores will always be 1.0. These two facts about z -scores (mean = 0; standard deviation = 1) will be true no matter what sample you are dealing with and no matter what the original units of measurement are (e.g. seconds, percentages, number of widgets assembled, amount of preference for a product, attitude rating, amount of money spent). This is because transforming raw scores to z -scores automatically changes the measurement units from whatever they originally were to a new system of measurements expressed in standard deviation units.

Suppose Maree was interested in the performance statistics for the top 25% most accurate quality control inspectors in the sample. Given a sample size of 112, this would mean finding the top 28 inspectors in terms of their accuracy scores. Since Maree is interested in performance statistics, speed scores would also be of interest. Table 5.5 (generated using the SPSS Descriptives … procedure, listed using the Case Summaries … procedure and formatted for presentation using Excel) shows accuracy and speed scores for the top 28 inspectors in descending order of accuracy scores. The z -score transformation for each of these scores is also shown (last two columns) as are the type of company, education level and gender for each inspector.

Listing of the 28 (top 25%) most accurate QCI inspectors’ accuracy and speed scores as well as standard ( z ) score transformations for each score

Case numberInspectorcompanyeduclevgenderaccuracyspeedZaccuracyZspeed
18PC ManufacturerHigh School OnlyMale1001.521.95−1.03
29PC ManufacturerHigh School OnlyFemale1003.321.95−0.40
314PC ManufacturerHigh School OnlyMale1003.831.95−0.23
417PC ManufacturerHigh School OnlyFemale997.071.840.90
5101PC ManufacturerHigh School Only983.111.73−0.47
619PC ManufacturerTertiary QualifiedFemale943.841.29−0.22
734Large Electrical Appliance ManufacturerTertiary QualifiedMale941.901.29−0.89
863Large Business Computer ManufacturerHigh School OnlyMale9411.941.292.58
967Large Business Computer ManufacturerHigh School OnlyMale942.341.29−0.74
1080Large Business Computer ManufacturerHigh School OnlyFemale944.681.290.07
115PC ManufacturerTertiary QualifiedMale934.181.18−0.10
1218PC ManufacturerTertiary QualifiedMale937.321.180.98
1346Small Electrical Appliance ManufacturerTertiary QualifiedFemale932.011.18−0.86
1464Large Business Computer ManufacturerHigh School OnlyFemale925.181.080.24
1577Large Business Computer ManufacturerTertiary QualifiedFemale926.111.080.56
1679Large Business Computer ManufacturerHigh School OnlyMale924.381.08−0.03
17106Large Electrical Appliance ManufacturerTertiary QualifiedMale921.701.08−0.96
1858Small Electrical Appliance ManufacturerHigh School OnlyMale914.120.97−0.12
1963Large Business Computer ManufacturerHigh School OnlyMale914.730.970.09
2072Large Business Computer ManufacturerTertiary QualifiedMale914.720.970.08
2120PC ManufacturerHigh School OnlyMale904.530.860.02
2269Large Business Computer ManufacturerHigh School OnlyMale904.940.860.16
2371Large Business Computer ManufacturerHigh School OnlyFemale9010.460.862.07
2485Automobile ManufacturerTertiary QualifiedFemale903.140.86−0.46
25111Large Business Computer ManufacturerHigh School OnlyMale904.110.86−0.13
266PC ManufacturerHigh School OnlyMale895.460.750.34
2761Large Business Computer ManufacturerTertiary QualifiedMale895.710.750.43
2875Large Business Computer ManufacturerHigh School OnlyMale8912.050.752.62

There are three inspectors (8, 9 and 14) who scored maximum accuracy of 100%. Such accuracy converts to a z -score of +1.95. Thus 100% accuracy is 1.95 standard deviations above the sample’s mean accuracy level. Interestingly, all three inspectors worked for PC manufacturers and all three had only high school-level education. The least accurate inspector in the top 25% had a z -score for accuracy that was .75 standard deviations above the sample mean.

Interestingly, the top three inspectors in terms of accuracy had decision speeds that fell below the sample’s mean speed; inspector 8 was the fastest inspector of the three with a speed just over 1 standard deviation ( z  = −1.03) below the sample mean. The slowest inspector in the top 25% was inspector 75 (case #28 in the list) with a speed z -score of +2.62; i.e., he was over two and a half standard deviations slower in making inspection decisions relative to the sample’s mean speed.

The fact that z -scores always have a common measurement scale having a mean of 0 and a standard deviation of 1.0 leads to an interesting application of standard scores. Suppose we focus on inspector number 65 (case #8 in the list) in Table 5.5 . It might be of interest to compare this inspector’s quality control performance in terms of both his decision accuracy and decision speed. Such a comparison is impossible using raw scores since the inspector’s accuracy score and speed scores are different measures which have differing means and standard deviations expressed in fundamentally different units of measurement (percentages and seconds). However, if we are willing to assume that the score distributions for both variables are approximately the same shape and that both accuracy and speed are measured with about the same level of reliability or consistency (see Procedure 10.1007/978-981-15-2537-7_8#Sec1), we can compare the inspector’s two scores by first converting them to z -scores within their own respective distributions as shown in Table 5.5 .

Inspector 65 looks rather anomalous in that he demonstrated a relatively high level of accuracy (raw score = 94%; z  = +1.29) but took a very long time to make those accurate decisions (raw score = 11.94 s; z  = +2.58). Contrast this with inspector 106 (case #17 in the list) who demonstrated a similar level of accuracy (raw score = 92%; z  = +1.08) but took a much shorter time to make those accurate decisions (raw score = 1.70 s; z  = −.96). In terms of evaluating performance, from a company perspective, we might conclude that inspector 106 is performing at an overall higher level than inspector 65 because he can achieve a very high level of accuracy but much more quickly; accurate and fast is more cost effective and efficient than accurate and slow.

Note: We should be cautious here since we know from our previous explorations of the speed variable in Procedure 5.6 , that accuracy scores look fairly symmetrical and speed scores are positively skewed, so assuming that the two variables have the same distribution shape, so that z -score comparisons are permitted, would be problematic.

You might have noticed that as you scanned down the two columns of z -scores in Table 5.5 , there was a suggestion of a pattern between the signs attached to the respective z -scores for each person. There seems to be a very slight preponderance of pairs of z -scores where the signs are reversed (12 out of 22 pairs). This observation provides some very preliminary evidence to suggest that there may be a relationship between inspection accuracy and decision speed, namely that a more accurate decision tends to be associated with a faster decision speed. Of course, this pattern would be better verified using the entire sample rather than the top 25% of inspectors. However, you may find it interesting to learn that it is precisely this sort of suggestive evidence (about agreement or disagreement between z -score signs for pairs of variable scores throughout a sample) that is captured and summarised by a single statistical indicator called a ‘correlation coefficient’ (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

z -scores are not the only type of standard score that is commonly used. Three other types of standard scores are: stanines (standard nines), IQ scores and T-scores (not to be confused with the t -test described in Procedure 10.1007/978-981-15-2537-7_7#Sec22). These other types of scores have the advantage of producing only positive integer scores rather than positive and negative decimal scores. This makes interpretation somewhat easier for certain applications. However, you should know that almost all other types of standard scores come from a specific transformation of z -scores. This is because once you have converted raw scores into z -scores, they can then be quite readily transformed into any other system of measurement by simply multiplying a person’s z -score by the new desired standard deviation for the measure and adding to that product the new desired mean for the measure.

T-scores are simply z-scores transformed to have a mean of 50.0 and a standard deviation of 10.0; IQ scores are simply z-scores transformed to have a mean of 100 and a standard deviation of 15 (or 16 in some systems). For more information, see Fundamental Concept II .

Standard scores are useful for representing the position of each raw score within a sample distribution relative to the mean of that distribution. The unit of measurement becomes the number of standard deviations a specific score is away from the sample mean. As such, z -scores can permit cautious comparisons across samples or across different variables having vastly differing means and standard deviations within the constraints of the comparison samples having similarly shaped distributions and roughly equivalent levels of measurement reliability. z -scores also form the basis for establishing the degree of correlation between two variables. Transforming raw scores into z -scores does not change the shape of a distribution or rank ordering of individuals within that distribution. For this reason, a z -score is referred to as a linear transformation of a raw score. Interestingly, z -scores provide an important foundational element for more complex analytical procedures such as factor analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec36), cluster analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec41) and multiple regression analysis (see, for example, Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec86).

While standard scores are useful indices, they are subject to restrictions if used to compare scores across samples or across different variables. The samples must have similar distribution shapes for the comparisons to be meaningful and the measures must have similar levels of reliability in each sample. The groups used to generate the z -scores should also be similar in composition (with respect to age, gender distribution, and so on). Because z -scores are not an intuitively meaningful way of presenting scores to lay-persons, many other types of standard score schemes have been devised to improve interpretability. However, most of these schemes produce scores that run a greater risk of facilitating lay-person misinterpretations simply because their connection with z -scores is hidden or because the resulting numbers ‘look’ like a more familiar type of score which people do intuitively understand.

It is extremely rare for a T-score to exceed 100 or go below 0 because this would mean that the raw score was in excess of 5 standard deviations away from the sample mean. This unfortunately means that T-scores are often misinterpreted as percentages because they typically range between 0 and 100 and therefore ‘look’ like percentages. However, T-scores are definitely not percentages.

Finally, a common misunderstanding of z -scores is that transforming raw scores into z -scores makes them follow a normal distribution (see Fundamental Concept II ). This is not the case. The distribution of z -scores will have exactly the same shape as that for the raw scores; if the raw scores are positively skewed, then the corresponding z -scores will also be positively skewed.

z -scores are particularly useful in evaluative studies where relative performance indices are of interest. Whenever you compute a correlation coefficient ( Procedure 10.1007/978-981-15-2537-7_6#Sec4), you are implicitly transforming the two variables involved into z -scores (which equates the variables in terms of mean and standard deviation), so that only the patterning in the relationship between the variables is represented. z -scores are also useful as a preliminary step to more advanced parametric statistical methods when variables differing in scale, range and/or measurement units must be equated for means and standard deviations prior to analysis.

ApplicationProcedures
SPSS and tick the box labelled ‘Save standardized values as variables’. -scores are saved as new variables (labelled as Z followed by the original variable name as shown in Table ) which can then be listed or analysed further.
NCSS and select a new variable to hold the -scores, then select the ‘STANDARDIZE’ transformation from the list of available functions. -scores are saved as new variables which can then be listed or analysed further.
SYSTAT where -scores are saved as new variables which can then be listed or analysed further.
STATGRAPHICSOpen the window, and select an empty column in the database, then and choose the ‘STANDARDIZE’ transformation, choose the variable you want to transform and give the new variable a name.
Commander and select the variables you want to standardize; Commander automatically saves the transformed variable to the data base, appending Z. to the front of each variable’s name.

Fundamental Concept II: The Normal Distribution

Arguably the most fundamental distribution used in the statistical analysis of quantitative data in the behavioural and social sciences is the normal distribution (also known as the Gaussian or bell-shaped distribution ). Many behavioural phenomena, if measured on a large enough sample of people, tend to produce ‘normally distributed’ variable scores. This includes most measures of ability, performance and productivity, personality characteristics and attitudes. The normal distribution is important because it is the one form of distribution that you must assume describes the scores of a variable in the population when parametric tests of statistical inference are undertaken. The standard normal distribution is defined as having a population mean of 0.0 and a population standard deviation of 1.0. The normal distribution is also important as a means of interpreting various types of scoring systems.

Figure 5.28 displays the standard normal distribution (mean = 0; standard deviation = 1.0) and shows that there is a clear link between z -scores and the normal distribution. Statisticians have analytically calculated the probability (also expressed as percentages or percentiles) that observations will fall above or below any specific z -score in the theoretical standard normal distribution. Thus, a z -score of +1.0 in the standard normal distribution will have 84.13% (equals a probability of .8413) of observations in the population falling at or below one standard deviation above the mean and 15.87% falling above that point. A z -score of −2.0 will have 2.28% of observations falling at that point or below and 97.72% of observations falling above that point. It is clear then that, in a standard normal distribution, z -scores have a direct relationship with percentiles .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig28_HTML.jpg

The normal (bell-shaped or Gaussian) distribution

Figure 5.28 also shows how T-scores relate to the standard normal distribution and to z -scores. The mean T-score falls at 50 and each increment or decrement of 10 T-score units means a movement of another standard deviation away from this mean of 50. Thus, a T-score of 80 corresponds to a z -score of +3.0—a score 3 standard deviations higher than the mean of 50.

Of special interest to behavioural researchers are the values for z -scores in a standard normal distribution that encompass 90% of observations ( z  = ±1.645—isolating 5% of the distribution in each tail), 95% of observations ( z  = ±1.96—isolating 2.5% of the distribution in each tail), and 99% of observations ( z  = ±2.58—isolating 0.5% of the distribution in each tail).

Depending upon the degree of certainty required by the researcher, these bands describe regions outside of which one might define an observation as being atypical or as perhaps not belonging to a distribution being centred at a mean of 0.0. Most often, what is taken as atypical or rare in the standard normal distribution is a score at least two standard deviations away from the mean, in either direction. Why choose two standard deviations? Since in the standard normal distribution, only about 5% of observations will fall outside a band defined by z -scores of ±1.96 (rounded to 2 for simplicity), this equates to data values that are 2 standard deviations away from their mean. This can give us a defensible way to identify outliers or extreme values in a distribution.

Thinking ahead to what you will encounter in Chap. 10.1007/978-981-15-2537-7_7, this ‘banding’ logic can be extended into the world of statistics (like means and percentages) as opposed to just the world of observations. You will frequently hear researchers speak of some statistic estimating a specific value (a parameter ) in a population, plus or minus some other value.

A survey organisation might report political polling results in terms of a percentage and an error band, e.g. 59% of Australians indicated that they would vote Labour at the next federal election, plus or minus 2%.

Most commonly, this error band (±2%) is defined by possible values for the population parameter that are about two standard deviations (or two standard errors—a concept discussed further in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) away from the reported or estimated statistical value. In effect, the researcher is saying that on 95% of the occasions he/she would theoretically conduct his/her study, the population value estimated by the statistic being reported would fall between the limits imposed by the endpoints of the error band (the official name for this error band is a confidence interval ; see Procedure 10.1007/978-981-15-2537-7_8#Sec18). The well-understood mathematical properties of the standard normal distribution are what make such precise statements about levels of error in statistical estimates possible.

Checking for Normality

It is important to understand that transforming the raw scores for a variable to z -scores (recall Procedure 5.7 ) does not produce z -scores which follow a normal distribution; rather they will have the same distributional shape as the original scores. However, if you are willing to assume that the normal distribution is the correct reference distribution in the population, then you are justified is interpreting z -scores in light of the known characteristics of the normal distribution.

In order to justify this assumption, not only to enhance the interpretability of z -scores but more generally to enhance the integrity of parametric statistical analyses, it is helpful to actually look at the sample frequency distributions for variables (using a histogram (illustrated in Procedure 5.2 ) or a boxplot (illustrated in Procedure 5.6 ), for example), since non-normality can often be visually detected. It is important to note that in the social and behavioural sciences as well as in economics and finance, certain variables tend to be non-normal by their very nature. This includes variables that measure time taken to complete a task, achieve a goal or make decisions and variables that measure, for example, income, occurrence of rare or extreme events or organisational size. Such variables tend to be positively skewed in the population, a pattern that can often be confirmed by graphing the distribution.

If you cannot justify an assumption of ‘normality’, you may be able to force the data to be normally distributed by using what is called a ‘normalising transformation’. Such transformations will usually involve a nonlinear mathematical conversion (such as computing the logarithm, square root or reciprocal) of the raw scores. Such transformations will force the data to take on a more normal appearance so that the assumption of ‘normality’ can be reasonably justified, but at the cost of creating a new variable whose units of measurement and interpretation are more complicated. [For some non-normal variables, such as the occurrence of rare, extreme or catastrophic events (e.g. a 100-year flood or forest fire, coronavirus pandemic, the Global Financial Crisis or other type of financial crisis, man-made or natural disaster), the distributions cannot be ‘normalised’. In such cases, the researcher needs to model the distribution as it stands. For such events, extreme value theory (e.g. see Diebold et al. 2000 ) has proven very useful in recent years. This theory uses a variation of the Pareto or Weibull distribution as a reference, rather than the normal distribution, when making predictions.]

Figure 5.29 displays before and after pictures of the effects of a logarithmic transformation on the positively skewed speed variable from the QCI database. Each graph, produced using NCSS, is of the hybrid histogram-density trace-boxplot type first illustrated in Procedure 5.6 . The left graph clearly shows the strong positive skew in the speed scores and the right graph shows the result of taking the log 10 of each raw score.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig29_HTML.jpg

Combined histogram-density trace-boxplot graphs displaying the before and after effects of a ‘normalising’ log 10 transformation of the speed variable

Notice how the long tail toward slow speed scores is pulled in toward the mean and the very short tail toward fast speed scores is extended away from the mean. The result is a more ‘normal’ appearing distribution. The assumption would then be that we could assume normality of speed scores, but only in a log 10 format (i.e. it is the log of speed scores that we assume is normally distributed in the population). In general, taking the logarithm of raw scores provides a satisfactory remedy for positively skewed distributions (but not for negatively skewed ones). Furthermore, anything we do with the transformed speed scores now has to be interpreted in units of log 10 (seconds) which is a more complex interpretation to make.

Another visual method for detecting non-normality is to graph what is called a normal Q-Q plot (the Q-Q stands for Quantile-Quantile). This plots the percentiles for the observed data against the percentiles for the standard normal distribution (see Cleveland 1995 for more detailed discussion; also see Lane 2007 , http://onlinestatbook.com/2/advanced_graphs/ q-q_plots.html) . If the pattern for the observed data follows a normal distribution, then all the points on the graph will fall approximately along a diagonal line.

Figure 5.30 shows the normal Q-Q plots for the original speed variable and the transformed log-speed variable, produced using the SPSS Explore... procedure. The diagnostic diagonal line is shown on each graph. In the left-hand plot, for speed , the plot points clearly deviate from the diagonal in a way that signals positive skewness. The right-hand plot, for log_speed, shows the plot points generally falling along the diagonal line thereby conforming much more closely to what is expected in a normal distribution.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig30_HTML.jpg

Normal Q-Q plots for the original speed variable and the new log_speed variable

In addition to visual ways of detecting non-normality, there are also numerical ways. As highlighted in Chap. 10.1007/978-981-15-2537-7_1, there are two additional characteristics of any distribution, namely skewness (asymmetric distribution tails) and kurtosis (peakedness of the distribution). Both have an associated statistic that provides a measure of that characteristic, similar to the mean and standard deviation statistics. In a normal distribution, the values for the skewness and kurtosis statistics are both zero (skewness = 0 means a symmetric distribution; kurtosis = 0 means a mesokurtic distribution). The further away each statistic is from zero, the more the distribution deviates from a normal shape. Both the skewness statistic and the kurtosis statistic have standard errors (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) associated with them (which work very much like the standard deviation, only for a statistic rather than for observations); these can be routinely computed by almost any statistical package when you request a descriptive analysis. Without going into the logic right now (this will come in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1), a rough rule of thumb you can use to check for normality using the skewness and kurtosis statistics is to do the following:

  • Prepare : Take the standard error for the statistic and multiply it by 2 (or 3 if you want to be more conservative).
  • Interval : Add the result from the Prepare step to the value of the statistic and subtract the result from the value of the statistic. You will end up with two numbers, one low - one high, that define the ends of an interval (what you have just created approximates what is called a ‘confidence interval’, see Procedure 10.1007/978-981-15-2537-7_8#Sec18).
  • Check : If zero falls inside of this interval (i.e. between the low and high endpoints from the Interval step), then there is likely to be no significant issue with that characteristic of the distribution. If zero falls outside of the interval (i.e. lower than the low value endpoint or higher than the high value endpoint), then you likely have an issue with non-normality with respect to that characteristic.

Visually, we saw in the left graph in Fig. 5.29 that the speed variable was highly positively skewed. What if Maree wanted to check some numbers to support this judgment? She could ask SPSS to produce the skewness and kurtosis statistics for both the original speed variable and the new log_speed variable using the Frequencies... or the Explore... procedure. Table 5.6 shows what SPSS would produce if the Frequencies ... procedure were used.

Skewness and kurtosis statistics and their standard errors for both the original speed variable and the new log_speed variable

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab6_HTML.jpg

Using the 3-step check rule described above, Maree could roughly evaluate the normality of the two variables as follows:

  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] 1.487 − .458 = 1.029 and 1.487 + .458 = 1.945 ➔ [Check] zero does not fall inside the interval bounded by 1.029 and 1.945, so there appears to be a significant problem with skewness. Since the value for the skewness statistic (1.487) is positive, this means the problem is positive skewness, confirming what the left graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] 3.071 − .91 = 2.161 and 3.071 + .91 = 3.981 ➔ [Check] zero does not fall in interval bounded by 2.161 and 3.981, so there appears to be a significant problem with kurtosis. Since the value for the kurtosis statistic (1.487) is positive, this means the problem is leptokurtosis—the peakedness of the distribution is too tall relative to what is expected in a normal distribution.
  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] −.050 − .458 = −.508 and −.050 + .458 = .408 ➔ [Check] zero falls within interval bounded by −.508 and .408, so there appears to be no problem with skewness. The log transform appears to have corrected the problem, confirming what the right graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] −.672 – .91 = −1.582 and −.672 + .91 = .238 ➔ [Check] zero falls within interval bounded by −1.582 and .238, so there appears to be no problem with kurtosis. The log transform appears to have corrected this problem as well, rendering the distribution more approximately mesokurtic (i.e. normal) in shape.

There are also more formal tests of significance (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1) that one can use to numerically evaluate normality, such as the Kolmogorov-Smirnov test and the Shapiro-Wilk’s test . Each of these tests, for example, can be produced by SPSS on request, via the Explore... procedure.

1 For more information, see Chap. 10.1007/978-981-15-2537-7_1 – The language of statistics .

References for Procedure 5.1

  • Allen P, Bennett K, Heritage B. SPSS statistics: A practical guide. 4. South Melbourne, VIC: Cengage Learning Australia Pty; 2019. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. New York: Routledge; 2019. [ Google Scholar ]

Useful Additional Readings for Procedure 5.1

  • Agresti A. Statistical methods for the social sciences. 5. Boston: Pearson; 2018. [ Google Scholar ]
  • Argyrous G. Statistics for research: With a guide to SPSS. 3. London: Sage; 2011. [ Google Scholar ]
  • De Vaus D. Analyzing social science data: 50 key problems in data analysis. London: Sage; 2002. [ Google Scholar ]
  • Glass GV, Hopkins KD. Statistical methods in education and psychology. 3. Upper Saddle River, NJ: Pearson; 1996. [ Google Scholar ]
  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 10. Belmont, CA: Wadsworth Cengage; 2017. [ Google Scholar ]
  • Steinberg WJ. Statistics alive. 2. Los Angeles: Sage; 2011. [ Google Scholar ]

References for Procedure 5.2

  • Chang W. R graphics cookbook: Practical recipes for visualizing data. 2. Sebastopol, CA: O’Reilly Media; 2019. [ Google Scholar ]
  • Jacoby WG. Statistical graphics for univariate and bivariate data. Thousand Oaks, CA: Sage; 1997. [ Google Scholar ]
  • McCandless D. Knowledge is beautiful. London: William Collins; 2014. [ Google Scholar ]
  • Smithson MJ. Statistics with confidence. London: Sage; 2000. [ Google Scholar ]
  • Toseland M, Toseland S. Infographica: The world as you have never seen it before. London: Quercus Books; 2012. [ Google Scholar ]
  • Wilkinson L. Cognitive science and graphic design. In: SYSTAT Software Inc, editor. SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. pp. 1–21. [ Google Scholar ]

Useful Additional Readings for Procedure 5.2

  • Field A. Discovering statistics using SPSS for windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. Boston, MA: Pearson Education; 2019. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Graphics. Kaysville, UT: Number Cruncher Statistical Systems; 2012. [ Google Scholar ]
  • StatPoint Technologies, Inc . STATGRAPHICS Centurion XVI user manual. Warrenton, VA: StatPoint Technologies Inc.; 2010. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

References for Procedure 5.3

  • Cleveland WR. Visualizing data. Summit, NJ: Hobart Press; 1995. [ Google Scholar ]
  • Jacoby WJ. Statistical graphics for visualizing multivariate data. Thousand Oaks, CA: Sage; 1998. [ Google Scholar ]

Useful Additional Readings for Procedure 5.3

  • Kirk A. Data visualisation: A handbook for data driven design. Los Angeles: Sage; 2016. [ Google Scholar ]
  • Knaflic CN. Storytelling with data: A data visualization guide for business professionals. Hoboken, NJ: Wiley; 2015. [ Google Scholar ]
  • Tufte E. The visual display of quantitative information. 2. Cheshire, CN: Graphics Press; 2001. [ Google Scholar ]

Reference for Procedure 5.4

Useful additional readings for procedure 5.4.

  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill Inc; 1991. [ Google Scholar ]

References for Procedure 5.5

Useful additional readings for procedure 5.5.

  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 9. Belmont, CA: Wadsworth Cengage; 2012. [ Google Scholar ]

References for Fundamental Concept I

Useful additional readings for fundamental concept i.

  • Howell DC. Statistical methods for psychology. 8. Belmont, CA: Cengage Wadsworth; 2013. [ Google Scholar ]

References for Procedure 5.6

  • Norušis MJ. IBM SPSS statistics 19 guide to data analysis. Upper Saddle River, NJ: Prentice Hall; 2012. [ Google Scholar ]
  • Field A. Discovering statistics using SPSS for Windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Introduction. Kaysville, UT: Number Cruncher Statistical System; 2012. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Statistics - I. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

Useful Additional Readings for Procedure 5.6

  • Hartwig F, Dearing BE. Exploratory data analysis. Beverly Hills, CA: Sage; 1979. [ Google Scholar ]
  • Leinhardt G, Leinhardt L. Exploratory data analysis. In: Keeves JP, editor. Educational research, methodology, and measurement: An international handbook. 2. Oxford: Pergamon Press; 1997. pp. 519–528. [ Google Scholar ]
  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill, Inc.; 1991. [ Google Scholar ]
  • Tukey JW. Exploratory data analysis. Reading, MA: Addison-Wesley Publishing; 1977. [ Google Scholar ]
  • Velleman PF, Hoaglin DC. ABC’s of EDA. Boston: Duxbury Press; 1981. [ Google Scholar ]

Useful Additional Readings for Procedure 5.7

References for fundemental concept ii.

  • Diebold FX, Schuermann T, Stroughair D. Pitfalls and opportunities in the use of extreme value theory in risk management. The Journal of Risk Finance. 2000; 1 (2):30–35. doi: 10.1108/eb043443. [ CrossRef ] [ Google Scholar ]
  • Lane D. Online statistics education: A multimedia course of study. Houston, TX: Rice University; 2007. [ Google Scholar ]

Useful Additional Readings for Fundemental Concept II

  • Keller DK. The tao of statistics: A path to understanding (with no math) Thousand Oaks, CA: Sage; 2006. [ Google Scholar ]

BUSINESS STATISTICS: Introduction and Descriptive Statistics

  • October 2019

Rusdin Tahir at Universitas Padjadjaran

  • Universitas Padjadjaran
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Descriptive Statistics

The analysis, summary, and presentation of findings related to a data set derived from a sample or entire population

What is Descriptive Statistics?

The term “descriptive statistics” refers to the analysis, summary, and presentation of findings related to a data set derived from a sample or entire population. Descriptive statistics comprises three main categories – Frequency Distribution, Measures of Central Tendency , and Measures of Variability.

Descriptive Statistics

Although descriptive statistics may provide information regarding a data set, they do not allow for conclusions to be made based on the data analysis but rather provide a description of the data being analyzed.

  • The term “descriptive statistics” refers to the analysis, summary, and presentation of findings related to a data set derived from a sample or entire population.
  • Descriptive statistics comprises three main categories – Frequency Distribution, Measures of Central Tendency, and Measures of Variability.
  • Descriptive statistics helps facilitate data visualization. It allows for data to be presented in a meaningful and understandable way, which, in turn, allows for a simplified interpretation of the data set in question.

Understanding the Different Types of Descriptive Statistics

Frequency distribution.

Used for both quantitative and qualitative data, frequency distribution depicts the frequency or count of the different outcomes in a data set or sample. The frequency distribution is normally presented in a table or a graph. Each entry in the table or graph is accompanied by the count or frequency of the values’ occurrences in an interval, range, or specific group.

Frequency distribution is basically a presentation or summary of grouped data categorized based on mutually exclusive classes and the number of occurrences in each respective class. It allows for a more structured and organized way to present raw data.

Common charts and graphs used in frequency distribution presentation and visualization include bar charts, histograms , pie charts, and line charts.

Central Tendency

Measures of Central Tendency

Central tendency refers to a dataset’s descriptive summary using a single value reflecting the center of the data distribution. Measures of central tendency are also known as measures of central location. The mean, median , and mode are the measures of central tendency.

The mean, considered the most popular measure of central tendency, is the average or most common value in a data set. The median refers to the middle score for a data set in ascending order. The mode refers to the score or value that is most frequent in a data set.

  • Variability

A measure of variability is a summary statistic reflecting the degree of dispersion in a sample. The measures of variability determine how far apart the data points appear to fall from the center.

Dispersion, spread, and variability all refer to and denote the range and width of the distribution of values in a data set. The range, standard deviation , and variance are used, respectively, to depict different components and aspects of the spread.

The range depicts the degree of dispersion or an ideal of the distance between the highest and lowest values within a data set. The standard deviation is used to determine the average variance in a set of data and provide an insight into the distance or difference between a value in a data set and the mean value of the same data set. The variance reflects the degree of the spread and is essentially an average of the squared deviations.

For a deeper understanding of different foundational statistics concepts and tools, check out CFI’s Statistics Fundamentals course!

Importance of Descriptive Statistics

Descriptive statistics allow for the ease of data visualization. It allows for data to be presented in a meaningful and understandable way, which, in turn, allows for a simplified interpretation of the data set in question. Raw data would be difficult to analyze, and trend and pattern determination may be challenging to perform. In addition, raw data makes it challenging to visualize what the data is showing.

Consider the following example:

There are 100 students enrolled for a particular module. To find the overall performance of the students taking the respective module and the distribution of the marks, descriptive statistics must be used. Getting the marks as raw data would prove the determination of the overall performance and the distribution of the marks to be challenging.

Furthermore, descriptive statistics allow for a data set to be summarized and presented through a combination of tabulated and graphical descriptions and a discussion of the results found. Descriptive statistics are used to summarize complex quantitative data.

More Resources

Thank you for reading CFI’s guide to Descriptive Statistics. To keep advancing your career, the additional resources below will be useful:

  • Basic Statistics Concepts in Finance
  • Arithmetic Mean
  • Excel Dashboards and Data Visualization Course
  • See all data science resources
  • Share this article

Excel Fundamentals - Formulas for Finance

Create a free account to unlock this Template

Access and download collection of free Templates to help power your productivity and performance.

Already have an account? Log in

Supercharge your skills with Premium Templates

Take your learning and productivity to the next level with our Premium Templates.

Upgrading to a paid membership gives you access to our extensive collection of plug-and-play Templates designed to power your performance—as well as CFI's full course catalog and accredited Certification Programs.

Already have a Self-Study or Full-Immersion membership? Log in

Access Exclusive Templates

Gain unlimited access to more than 250 productivity Templates, CFI's full course catalog and accredited Certification Programs, hundreds of resources, expert reviews and support, the chance to work with real-world finance and research tools, and more.

Already have a Full-Immersion membership? Log in

What Is Descriptive Statistics?

Descriptive statistics is a statistical measure used to describe data through numbers, like mean, median and mode. Here’s how to calculate them.

Satyapriya Chaudhari

In this article, I’ll help you understand the difference between descriptive statistics and inferential statistics. Then we’ll walk through some examples of descriptive statistics and how you can calculate them yourself.

What Is Statistics?

Statistics is the science of collecting data and analyzing them to infer proportions (sample) that are representative of the population. In other words, statistics is interpreting data in order to make predictions for the population.

There are two branches of statistics.

  • Descriptive Statistics : Descriptive statistics is a statistical measure that describes data.
  • Inferential Statistics : You practice inferential statistics when you use a random sample of data taken from a population to describe and make inferences about the population.

Descriptive Statistics vs. Inferential Statistics?

Descriptive statistics summarize data through certain numbers like mean, median, mode, etc. so as to make it easier to understand and interpret the data. Descriptive statistics don’t involve any generalization or inference beyond what is immediately available. This means that the descriptive statistics represent the available data (sample) and aren’t based on any theory of probability.

Commonly Used Measures

  • Measures of central tendency
  • Measures of dispersion (or variability)

What Are the Measures of Central Tendency?

A measure of central tendency is a one-number summary of the data that typically describes the center of the data. This one-number summary is of three types.

1. What Is the Mean?

Mean is the ratio of the sum of all observations in the data to the total number of observations. This is also known as average. Thus, mean is a number around which the entire data set is spread.

More on Statistics Statistical Tests: When to Use T-Test, Chi-Square and More

2. What Is the Median?  

Median is the point which divides the entire data into two equal halves. One half of the data is less than the median and the other half is greater than the median. Median is calculated by first arranging the data in either ascending or descending order.

  • If the number of observations is odd, median is given by the middle observation in the sorted form.
  • If the number of observations are even, median is given by the mean of the two middle observations in the sorted form.

An important point to note is that the order of the data (ascending or descending) does not affect the median.

3. What Is the Mode? 

Mode is the number that has the maximum frequency in the entire data set. In other words, mode is the number that appears the mo st often. A data can have one or more than one mode.

  • If there is only one number that appears the most number of times, the data has one mode, and is called uni-modal .
  • If there are two numbers that appear equally frequently, the data has two modes, and is called bi-modal .
  • If there are more than two numbers that appear equally frequently, the data has more than two modes. We call that multi-modal .

How to Find the Mean, Median and Mode

Consider the following data points:

17, 16, 21, 18, 15, 17, 21, 19, 11, 23

We calculate the mean as:

To calculate the median, let’s arrange the data in ascending order:

11, 15, 16, 17, 17, 18, 19, 21, 21, 23

Since the number of observations is even (10), median is given by the average of the two middle observations (fifth and sixth here).

Mode is given by the number that occurs the most number of times. Here, 17 and 21 both occur twice. Hence, this is a bi-modal data set and the modes are 17 and 21.

A few things to note: 

  • Since median and mode don’t consider all the data points for calculations, median and mode are robust against outliers (i.e., these are not affected by outliers).
  • At the same time, mean shifts toward the outlier as it considers all the data points. This means if the outlier is big, mean overestimates the data, and if it is small, the data is underestimated.
  • If the distribution is symmetrical, mean = median = mode, or what we would call “normal distribution.”

More on Data How to Find Outliers With IQR Using Python

What Are Measures of Dispersion?

Measures of dispersion describe the spread of the data around the central value (or the measures of central tendency).

7 Measures of Dispersion

  • Absolute deviation from mean
  • Standard deviation

1.  Absolute Deviation From Mean

The absolute deviation from the mean, also called mean absolute deviation (MAD), describes the variation in the data set. In a sense, it tells you the average absolute distance of each data point in the set. We calculate it as:  

2. Variance

Variance measures how far data points spread out from the mean. A high variance indicates that data points are spread widely and a small variance indicates that the data points are closer to the data set’s mean. We calculate it as:

3. Standard Deviation

The square root of variance is called the standard deviation. We calculate it as:

Dive Into Probability Distributions 4 Probability Distributions Every Data Scientist Needs to Know

Range is the difference between the maximum value and the minimum value in the data set. It is given as:

5. Quartiles

Quartiles are the points in the data set that divides the data set into four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data set.

  • 25 percent of the data points lie below Q1 and 75 percent lie above it.
  • 50 percent of the data points lie below Q2 and 50 percent lie above it. Q2 is nothing but median.
  • 75 percent of the data points lie below Q3 and 25 percent lie above it.

More on Quartiles Calculating Quartiles: A Step-by-Step Explaination

6. Skewness

The measure of asymmetry in a probability distribution is defined by skewness . Skewness can either be positive, negative or undefined. We’ll focus on positive and negative skew.

  • Positive Skew: This is the case when the tail on the right side of the curve is bigger than that on the left side. For these distributions, the mean is greater than the mode.
  • Negative Skew: This is the case when the tail on the left side of the curve is bigger than that on the right side. For these distributions, mean is smaller than the mode.

The most commonly used method of calculating skewness is:

If the skewness is zero, the distribution is symmetrical. If it is negative, the distribution is negatively skewed and if it is positive, it is positively skewed.

7. Kurtosis

Kurtosis describes whether the data is light tailed (lack of outliers) or heavy tailed (outliers present) when compared to a normal distribution. There are three kinds of kurtosis:

  • Mesokurtic: This is the case when the kurtosis is zero, similar to normal distributions.
  • Leptokurtic: This is when the tail of the distribution is heavy (outlier present) and kurtosis is higher than that of the normal distribution.
  • Platykurtic: This is when the tail of the distribution is light (no outlier) and kurtosis is lesser than that of the normal distribution.

Recent Big Data Articles

38 Companies Hiring Data Scientists

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

descriptive statistics in business research

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Creating Brand Value
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Descriptive Analytics? 5 Examples

Professional looking at descriptive analytics on computer

  • 09 Nov 2021

Data analytics is a valuable tool for businesses aiming to increase revenue, improve products, and retain customers. According to research by global management consulting firm McKinsey & Company, companies that use data analytics are 23 times more likely to outperform competitors in terms of new customer acquisition than non-data-driven companies. They were also nine times more likely to surpass them in measures of customer loyalty and 19 times more likely to achieve above-average profitability.

Data analytics can be broken into four key types :

  • Descriptive, which answers the question, “What happened?”
  • Diagnostic , which answers the question, “Why did this happen?”
  • Predictive , which answers the question, “What might happen in the future?”
  • Prescriptive , which answers the question, “What should we do next?”

Each type of data analysis can help you reach specific goals and be used in tandem to create a full picture of data that informs your organization’s strategy formulation and decision-making.

Descriptive analytics can be leveraged on its own or act as a foundation for the other three analytics types. If you’re new to the field of business analytics, descriptive analytics is an accessible and rewarding place to start.

Access your free e-book today.

What Is Descriptive Analytics?

Descriptive analytics is the process of using current and historical data to identify trends and relationships. It’s sometimes called the simplest form of data analysis because it describes trends and relationships but doesn’t dig deeper.

Descriptive analytics is relatively accessible and likely something your organization uses daily. Basic statistical software, such as Microsoft Excel or data visualization tools , such as Google Charts and Tableau, can help parse data, identify trends and relationships between variables, and visually display information.

Descriptive analytics is especially useful for communicating change over time and uses trends as a springboard for further analysis to drive decision-making .

Here are five examples of descriptive analytics in action to apply at your organization.

Related: 5 Business Analytics Skills for Professionals

5 Examples of Descriptive Analytics

1. traffic and engagement reports.

One example of descriptive analytics is reporting. If your organization tracks engagement in the form of social media analytics or web traffic, you’re already using descriptive analytics.

These reports are created by taking raw data—generated when users interact with your website, advertisements, or social media content—and using it to compare current metrics to historical metrics and visualize trends.

For example, you may be responsible for reporting on which media channels drive the most traffic to the product page of your company’s website. Using descriptive analytics, you can analyze the page’s traffic data to determine the number of users from each source. You may decide to take it one step further and compare traffic source data to historical data from the same sources. This can enable you to update your team on movement; for instance, highlighting that traffic from paid advertisements increased 20 percent year over year.

The three other analytics types can then be used to determine why traffic from each source increased or decreased over time, if trends are predicted to continue, and what your team’s best course of action is moving forward.

2. Financial Statement Analysis

Another example of descriptive analytics that may be familiar to you is financial statement analysis. Financial statements are periodic reports that detail financial information about a business and, together, give a holistic view of a company’s financial health.

There are several types of financial statements, including the balance sheet , income statement , cash flow statement , and statement of shareholders’ equity. Each caters to a specific audience and conveys different information about a company’s finances.

Financial statement analysis can be done in three primary ways: vertical, horizontal, and ratio.

Vertical analysis involves reading a statement from top to bottom and comparing each item to those above and below it. This helps determine relationships between variables. For instance, if each line item is a percentage of the total, comparing them can provide insight into which are taking up larger and smaller percentages of the whole.

Horizontal analysis involves reading a statement from left to right and comparing each item to itself from a previous period. This type of analysis determines change over time.

Finally, ratio analysis involves comparing one section of a report to another based on their relationships to the whole. This directly compares items across periods, as well as your company’s ratios to the industry’s to gauge whether yours is over- or underperforming.

Each of these financial statement analysis methods are examples of descriptive analytics, as they provide information about trends and relationships between variables based on current and historical data.

Credential of Readiness | Master the fundamentals of business | Learn More

3. Demand Trends

Descriptive analytics can also be used to identify trends in customer preference and behavior and make assumptions about the demand for specific products or services.

Streaming provider Netflix’s trend identification provides an excellent use case for descriptive analytics. Netflix’s team—which has a track record of being heavily data-driven—gathers data on users’ in-platform behavior. They analyze this data to determine which TV series and movies are trending at any given time and list trending titles in a section of the platform’s home screen.

Not only does this data allow Netflix users to see what’s popular—and thus, what they might enjoy watching—but it allows the Netflix team to know which types of media, themes, and actors are especially favored at a certain time. This can drive decision-making about future original content creation, contracts with existing production companies, marketing, and retargeting campaigns.

4. Aggregated Survey Results

Descriptive analytics is also useful in market research. When it comes time to glean insights from survey and focus group data, descriptive analytics can help identify relationships between variables and trends.

For instance, you may conduct a survey and identify that as respondents’ age increases, so does their likelihood to purchase your product. If you’ve conducted this survey multiple times over several years, descriptive analytics can tell you if this age-purchase correlation has always existed or if it was something that only occurred this year.

Insights like this can pave the way for diagnostic analytics to explain why certain factors are correlated. You can then leverage predictive and prescriptive analytics to plan future product improvements or marketing campaigns based on those trends.

Related: What Is Marketing Analytics?

5. Progress to Goals

Finally, descriptive analytics can be applied to track progress to goals. Reporting on progress toward key performance indicators (KPIs) can help your team understand if efforts are on track or if adjustments need to be made.

For example, if your organization aims to reach 500,000 monthly unique page views, you can use traffic data to communicate how you’re tracking toward it. Perhaps halfway through the month, you’re at 200,000 unique page views. This would be underperforming because you’d like to be halfway to your goal at that point—at 250,000 unique page views. This descriptive analysis of your team’s progress can allow further analysis to examine what can be done differently to improve traffic numbers and get back on track to hit your KPI.

Business Analytics | Become a data-driven leader | Learn More

Using Data to Identify Relationships and Trends

“Never before has so much data about so many different things been collected and stored every second of every day,” says Harvard Business School Professor Jan Hammond in the online course Business Analytics . “In this world of big data, data literacy —the ability to analyze, interpret, and even question data—is an increasingly valuable skill.”

Leveraging descriptive analytics to communicate change based on current and historical data and as a foundation for diagnostic, predictive, and prescriptive analytics has the potential to take you and your organization far.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

descriptive statistics in business research

About the Author

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on 4 November 2022 by Pritha Bhandari . Revised on 9 January 2023.

Descriptive statistics summarise and organise characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population .

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalisable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, frequently asked questions.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarise the frequency of every possible value of a variable in numbers or percentages.

  • Simple frequency distribution table
  • Grouped frequency distribution table
Gender Number
Male 182
Female 235
Other 27

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Library visits in the past year Percent
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%

Measures of central tendency estimate the center, or average, of a data set. The mean , median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

Mean number of library visits
Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses = 6
Mean Divide the sum of values by to find : 57/6 =

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then, the median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 =

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Mode number of library visits
Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response:

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.
Raw data Deviation from mean Squared deviation
15 15 – 9.5 = 5.5 30.25
3 3 – 9.5 = -6.5 42.25
12 12 – 9.5 = 2.5 6.25
0 0 – 9.5 = -9.5 90.25
24 24 – 9.5 = 14.5 210.25
3 3 – 9.5 = -6.5 42.25
= 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

Visits to the library
6
Mean 9.5
Median 7.5
Mode 3
Standard deviation 9.18
Variance 84.3
Range 24

If you were to only consider the mean as a measure of central tendency, your impression of the ‘middle’ of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to extreme values, you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read ‘across’ the table to see how the independent and dependent variables relate to each other.

Number of visits to the library in the past year
Group 0–4 5–8 9–12 13–16 17+
Children 32 68 37 23 22
Adults 36 48 43 83 25

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

Visits to the library in the past year (Percentages)
Group 0–4 5–8 9–12 13–16 17+
Children 18% 37% 20% 13% 12% 182
Adults 15% 20% 18% 35% 11% 235

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables. It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

Descriptive statistics summarise the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalisable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarise only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2023, January 09). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved 12 August 2024, from https://www.scribbr.co.uk/stats/descriptive-statistics-explained/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, data collection methods | step-by-step guide & examples, variability | calculating range, iqr, variance, standard deviation, normal distribution | examples, formulas, & uses.

Logo for University of Washington Libraries

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

2 Descriptive Statistics

Descriptive statistics, student learning outcomes.

By the end of this chapter, the student should be able to:

  • Display data graphically and interpret graphs: stemplots, histograms and boxplots.
  • Recognize, describe, and calculate the measures of location of data: quartiles and percentiles.
  • Recognize, describe, and calculate the measures of the center of data: mean, median, and mode.
  • Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, and range.

Introduction

Once you have collected data, what will you do with it? Data can be described and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the house prices, so you might ask your real estate agent to give you a sample data set of prices.  Looking at all the prices in the sample often is overwhelming.  A better way might be to look  at the median price and the variation of prices. The median and variation are just two ways that you will learn to describe data. Your agent might also provide you with a graph of the data.

In this chapter, you will study numerical and graphical ways to describe and display your data.  This area of statistics is called “Descriptive Statistics” . You will learn to calculate, and even more importantly, to interpret these measurements and graphs.

Displaying Data

A statistical graph is a tool that helps you learn about the shape or distribution of a sample.  The graph can    be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values.  Newspapers and the Internet use graphs to show trends and  to enable readers to compare facts and figures quickly.

Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied.

Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar chart, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), pie charts, and the boxplot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs and bar graphs. Our emphasis will be on histograms and boxplots.

Stem and Leaf Graphs (Stemplots), Line Graphs and Bar Graphs

One simple graph, the stem-and-leaf graph or stem plot , comes from the field of exploratory data analysis.It is a good choice when the data sets are small.  To  create the plot, divide each observation of data into  a stem and a leaf. The leaf consists of a final significant digit . For example, 23 has stem 2 and leaf 3. Four hundred thirty-two (432) has stem 43 and leaf 2. Five thousand four hundred thirty-two (5,432) has stem  543 and leaf 2. The decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from smallest the largest. Draw a vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem.

For Susan Dean’s spring pre-calculus class, scores for the first exam were as follows (smallest to largest):

33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94;

Stem-and-Leaf Diagram

3 3
4 299
5 355
6 1378899
7 2348
8 03888
9 0244446
10 0

The stem plot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or approximately 26% of the scores were in the 90’s or 100, a fairly high number of As.

The stem plot is a quick way to graph and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may indicate that something unusual is happening. It takes some background information to explain outliers. In the example above, there were no outliers.

Another type of graph that is useful for specific data values is a line graph .  In the particular line graph shown in the example, the x-axis consists of data values and the y-axis consists of frequency points . The frequency points are connected.

In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his/her chores. The results are shown in the table (Table 2) and the line graph (Figure 1).

0 2
1 5
2 8
3 14
4 7
5 4

descriptive statistics in business research

Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be rectangular boxes and they can be vertical or horizontal.

The bar graph shown in Figure 2 has age groups represented on the x-axis and proportions on the y-axis .

By the end of 2011, in  the  United  States,  Facebook  had  over  146  million  users.  The  table  shows three age groups, the number of  users  in  each  age  group  and  the  proportion  (%)  of  users in each age group. Source: http://www.kenburbary.com/2011/03/facebook-demographics- revisited-2011-statistics-2/

13 – 25 65,082,280 45%
26 – 44 53,300,200 36%
45 – 64 27,885,100 19%

descriptive statistics in business research

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set consists of 100 values or more.

A histogram consists of contiguous boxes. It has both a horizontal axis and a vertical axis.  The horizontal  axis is labeled with what the data represents (for instance, distance from your home to school). The vertical axis is labeled either Frequency or relative frequency .  The graph will have the same shape with either  label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the data. (The next section tells you how to calculate the center and the spread.)

The relative frequency is equal to the frequency for an observed value of the data divided by the total number of data values in the sample. (In the chapter on Sampling and Data), we defined frequency as the number of times an answer occurs.) If:

•    f = frequency

•    n = total number of data values (or the sum of the individual frequencies), and

•    RF = relative frequency,

RF = \frac{f}{n}

If 3 students in Mr. Ahab’s English class of 40 students received from 90% to 100%, then,

RF = \frac{3}{40} = 0.075

Seven and a half percent of the students received 90% to 100%. Ninety percent to 100 % are quantitative measures.

To construct a histogram, first decide how many bars or intervals , also called classes, represent the data. Many histograms consist of from 5 to 15 bars or classes for clarity. Choose a starting point for the first  interval to be less than the smallest data value.  A convenient starting point is a lower value carried out to one more decimal place than the value with the most decimal places. For example, if the value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 – 0.05 = 6.05). We  say that 6.05 has more precision.  If the value with the most decimal places is 2.23 and the lowest value    is 1.5, a convenient starting point is 1.495 (1.5 – 0.005 = 1.495). If the value with the most decimal places is

3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 – .0005 = 0.9995). If all the data happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 – 0.5 = 1.5). Also, when the starting point and other boundaries are carried to one additional decimal place,  no data value will fall on a boundary.

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. The heights are continuous data since height is measured.

60; 60.5; 61; 61; 61.5

63.5; 63.5; 63.5

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67;

67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5

68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5

70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71

72; 72; 72; 72.5; 72.5; 73; 73.5

The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance,  61.5),  we want our starting point to have two decimal places.  Since the numbers 0.5,  0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for  the convenient starting point.

60 – 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95.

The largest value is 74. 74+ 0.05 = 74.05 is the ending value.

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the ending value and divide by the number of bars (you must choose the number of     bars you desire). Suppose you choose 8 bars.

\frac{74.05-59.95}{8}=1.76

NOTE: We will round up to 2 and make each bar or class interval 2 units wide. Rounding up to 2 is one way to prevent a value from falling on a boundary. Rounding to the next number is necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the width would also work.

The boundaries are:

•    59.95

•    59.95 + 2 = 61.95

•    61.95 + 2 = 63.95

•    63.95 + 2 = 65.95

•    65.95 + 2 = 67.95

•    67.95 + 2 = 69.95

•    69.95 + 2 = 71.95

•    71.95 + 2 = 73.95

•    73.95 + 2 = 75.95

The heights 60 through 61.5 inches are in the interval 59.95 – 61.95.  The heights that are 63.5 are    in the interval 61.95 – 63.95. The heights that are 64 through 64.5 are in the interval 63.95 – 65.95. The heights 66 through 67.5 are in the interval 65.95 – 67.95. The heights 68 through 69.5 are in the interval 67.95 – 69.95. The heights 70 through 71 are in the interval 69.95 – 71.95. The heights 72

through 73.5 are in the interval 71.95 – 73.95. The height 74 is in the interval 73.95 – 75.95.

The following histogram ( Figure 3 ) displays the heights on the x-axis and relative frequency on the y-axis.

descriptive statistics in business research

The following data are the number of books bought by 50 part-time college students at  ABC College. The number of books is discrete data since books are counted.

1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1

2; 2; 2; 2; 2; 2; 2; 2; 2; 2

3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3

4; 4; 4; 4; 4; 4

5; 5; 5; 5; 5

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students buy 3 books. Six students buy 4 books. Five students buy 5 books. Two students buy 6 books.

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and the ending value is 6.5.

Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many different values, a width that places the data values in the middle of the bar or class interval is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 and the starting point is 0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from to 3.5 to 4.5, the 5 in the middle of the interval from 4.5 to 5.5, and the 6 in the middle of the interval from 5.5 to 6.5.

Calculate the number of bars as follows:

\frac{6.5-0.5}{number\;of\;bars} = 1

The following histogram ( Figure 4 ) displays the number of books on the x-axis and the frequency on the y-axis.

descriptive statistics in business research

Measures of the Location of the Data

The common measures of location are quartiles and percentiles

Quartiles are special percentiles. The first quartile, Q 1, is the same as the 25th percentile, and the third quartile, Q 3, is the same as the 75th percentile. The median, M , is called both the second quartile and the 50th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest.  Quartiles  divide  ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates into a score of at least 1220.

Percentiles  are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are     less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the “center” of the data. You  can think of the median as the “middle value,” but     it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data.

1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1

Ordered from smallest to largest:

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

\frac{6.8+7.2}{2}=7

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q 1, is the middle value of the lower half of the data, and the third quartile, Q 3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.

1; 1; 2; 2; 4; 6; 6.8

The number two, which is part of the data, is the first quartile . One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartile , Q 3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile ( Q 3) and the first quartile ( Q 1).

IQR = Q 3 – Q 1

The IQR can help to determine potential outliers . A value is suspected to be a potential outlier if it is less than (1.5)( IQR ) below the first quartile or more than (1.5)( IQR ) above the third quartile . Potential outliers always require further investigation.

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.

389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000;

488,800; 1,095,000

Order the data from smallest to largest.

114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000;

1,095,000; 5,500,000

M = 488,800

\frac{230,500 + 387,000}{2}

IQR = 649,000 – 308,750 = 340,250

(1.5)( IQR ) = (1.5)(340,250) = 510,375

Q 1 – (1.5)( IQR ) = 308,750 – 510,375 = –201,625

Q 3 + (1.5)( IQR ) = 649,000 + 510,375 = 1,159,375

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier .

For the two data sets in the test scores example , find the following:

a.     The interquartile range. Compare the two interquartile ranges.

b.     Any outliers in either set.

Solution –  Example 7

The five number summary for the day and night classes is

32 56 74.5 82.5 99
25.5 78 81 89 98

  a.      The IQR for the day group is Q 3 – Q 1 = 82.5 – 56 = 26.5 The IQR for the night group is Q 3 – Q 1 = 89 – 78 = 11

The interquartile range (the spread or variability) for the day class is larger than the night class IQR . This suggests more variation will be found in the day class’s class test scores.

b.      Day class outliers are found using the IQR times 1.5 rule. So,

Q 1 – IQR (1.5) = 56 – 26.5(1.5) = 16.25

Q 3 + IQR (1.5) = 82.5 + 26.5(1.5) = 122.25

Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.

Night class outliers are calculated as:

Q 1 – IQR (1.5) = 78 – 11(1.5) = 61.5

Q 3 + IQR(1.5) = 89 + 11(1.5) = 105.5

For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:

 

4 2 0.04 0.04
5 5 0.10 0.14
6 7 0.14 0.28
7 12 0.24 0.52
8 14 0.28 0.80
9 7 0.14 0.94
10 3 0.06 1.00

  Find the 28 th percentile . Notice the 0.28 in the “cumulative relative frequency” column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28 th percentile is 6.5.

Find the median . Look again at the “cumulative relative frequency” column and find 0.52. The median is the 50th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50th percentile is between the 25th, or seven, and 26th, or seven, values. The median is seven.

Find the third quartile . The third quartile is the same as the 75th percentile. You can “eyeball” this answer. If you look at the “cumulative relative frequency” column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75 th percentile, then, must be an eight . Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q 3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)

When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information:

  • information about the context of the  situation  being  considered,
  • the data value (value of the variable) that represents the percentile,
  • the percent of individuals or items with data values below the percentile.
  • Additionally, you may also choose to state the percent of individuals or items with data values above the percentile.

On a timed math test, the first quartile for times for finishing the exam was 35 minutes. Interpret the first quartile in the context of this situation.

25% of students finished the exam in 35 minutes or less. 75% of students finished the exam in 35 minutes or more.

A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)

On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret  the 70th percentile in the context of this situation.

70% of students answered 16 or fewer questions correctly. 30% of students answered 16 or more questions correctly.

Note: A high percentile could be considered good, as answering more questions correctly is desirable.

At a certain community college, it was found that the 30th percentile of credit units that students  are enrolled for is 7 units. Interpret the 30th percentile in the context of this situation.

30% of students are enrolled in 7 or fewer credit units 70% of students are enrolled in 7 or more credit units.

In this example, there is no “good” or “bad” value judgment associated  with  a  higher  or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.

Measures of the Center of the Data

The “center” of a data set is also a way of describing location. The two most widely used measures of the “center” of the data are the mean (average) and the median . To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts (previously discussed under box plots in this chapter). The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.

NOTE: The words “mean” and “average” are often used interchangeably. The substitution of one word for the other is common practice. The technical term is “arithmetic mean” and “average” is technically a center location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic mean.”

The mean can also be calculated by multiplying each distinct value by its frequency and then dividing the  sum by the total number of data values.  The letter used to represent the sample mean is an  x with a bar   over it (pronounced “ x bar”): x .

The Greek letter µ (pronounced “mew”) represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.

To see that both ways of calculating the mean are the same, consider the sample:  1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4

\bar{x}=\frac{1 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 4 + 4 + 4}{11}

Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the “center”: the mean or the median?

\bar{x}=\frac{5,000,000+49(30,000)}{50}

(There are 49 people who earn $30,000 and one person who earns $5,000,000.)

The median is a better measure of the “center” than the mean because 49 of the values are $30,000 and one is $5,000,000. The $5,000,000 is an outlier. The $30,000 gives us a better sense of the middle of the data.

Box plots or box-whisker plots give a good graphical image of the concentration of the data.  They also  show how far from most of the data the extreme values are.  The box plot is constructed from five values:    the smallest value, the first quartile, the median, the third quartile, and the largest value.  The median, the  first quartile, and the third quartile will be discussed here, and then again in the section on measuring data   in this chapter. We use these values to compare how close other data values are to them.

The median , a number, is a way of measuring the “center” of the data. You can think of the median as the “middle value,” although it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median and half the values are the same number or larger. For example, consider the following data:

1; 1; 2; 2; 4; 6; 6.8 ; 7.2 ; 8; 8.3; 9; 10; 10; 11.5

The median is between the 7th value, 6.8, and the 8th value 7.2. To find the median, add the two values together and divide by 2.

2 = 7 (2.0)

The median is 7. Half of the values are smaller than 7 and half of the values are larger than 7.

Quartiles are numbers that separate the data into quarters.  Quartiles may or may not be part of the data.     To find the quartiles, first find the median or second quartile. The first quartile is the middle value of the lower half of the data and the third quartile is the middle value of the upper half of the data.  To  get the      idea, consider the same data set shown above:

The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is 2.

1; 1; 2; 2 ; 4; 6; 6.8

The number 2, which is part of the data, is the first quartile . One-fourth of the values are the same or less than 2 and three-fourths of the values are more than 2.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is 9.

7.2; 8; 8.3; 9 ; 10; 10; 11.5

The number 9,  which is part of the data,  is the third quartile .  Three-fourths of the values are less than 9  and one-fourth of the values are more than 9.

To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest data values label the endpoints of the axis.  The first quartile marks one end of the box and the third quartile  marks the other end of the box. The middle fifty percent of the data fall inside the box. The “whiskers” extend from the ends of the box to the smallest and largest data values. The box plot gives a good quick picture of the data.

NOTE: You may encounter box and whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values.

Consider the following data:

1; 1; 2; 2; 4; 6; 6.8 ; 7.2; 8; 8.3; 9; 10; 10; 11.5

The first quartile is 2, the median is 7,  and the third quartile is 9.  The smallest value is 1 and the largest  value is 11.5.  The box plot is constructed as follows:

descriptive statistics in business research

The two whiskers extend from the first quartile to the smallest value and from the third quartile to the  largest value. The median is shown with a dashed line.

The following data are the heights of 40 students in a statistics class.

59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69; 70; 70; 70;

70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77

a.    Each quarter has 25% of the data.

b.    The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (3rd quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread.

c.   Interquartile Range: IQR = Q 3 Q 1 = 70 64.5 = 5.5.

d.    The interval 59 through 65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data.

e.    The middle 50% (middle half) of the data has a range of 5.5 inches.

For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both the third quartile and the median. For example, if the smallest value and the first quartile were both 1, the median and the third quartile were both 5, and the largest value was 7, the box plot would look as follows:

descriptive statistics in business research

The Law of Large Numbers and the Mean

\mu-x

This concept is so important and plays such a critical role in what follows it deserves to be developed further. Indeed, there are two critical issues that flow from the Central Limit Theorem and the application of the Law of Large numbers to it. These are

1.             The probability density function of the sampling distribution of means is normally distributed regardless of the underlying distribution of the population observations and

2.             standard deviation of the sampling distribution decreases as the size of the samples that were used to calculate the means for the sampling distribution increases.

Taking these in order. It would seem counterintuitive that the population may have any distribution and the distribution of means coming from it would be normally distributed. With the use of computers, experiments can be simulated that show the process by which the sampling distribution changes as the sample size is increased. These simulations show visually the results of the mathematical proof of the Central Limit Theorem.

Here are three examples of very different population distributions and the evolution of the sampling distribution to a normal distribution as the sample size increases. The top panel in these cases represents the histogram for the original data. The three panels show the histograms for 1,000 randomly drawn samples for different sample sizes: n=10, n= 25 and n=50. As the sample size increases, and the number of samples taken remains constant, the distribution of the 1,000 sample means becomes closer to the smooth line that represents the normal distribution.

Sampling Distributions and Statistic of a Sampling Distribution

You can think of a sampling distribution as a relative frequency distribution with a great many samples. (See Sampling and Data for a review of relative frequency). Suppose thirty randomly selected students  were asked the number of movies they watched the previous week. The results are in the relative frequency table shown below.

0 5/30
1 15/30
2 6/30
3 4/30
4 1/30

If you let the number of samples get very large (say, 300 million or more), the relative frequency table becomes a relative frequency distribution.

Skewness and the Mean, Median, and Mode1

Consider the following data set:

4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10

This data set produces the histogram shown below. Each interval has width one and each value is located in the middle of an interval.

descriptive statistics in business research

The histogram displays a symmetrical distribution of data.  A distribution is symmetrical if a vertical line   can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each 7 for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal) and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.

The histogram for the data:

4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 8 is not symmetrical. The right-hand side seems “chopped off” compared to the left side. The shape distribution is called skewed to the left because it is pulled out to the left.

descriptive statistics in business research

The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the mean is less than the median and they are both less than the mode.  The mean and the median both reflect the skewing but the mean more  so.

The histogram for the  data:

6 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10 is also not symmetrical. It is skewed to the right .

descriptive statistics in business research

The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, the mean is the largest, while the mode is the smallest . Again, the mean reflects the skewing the most.

To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right,  the mode is often less  than the median, which is less than the mean.

Skewness and symmetry become important when we discuss probability distributions in later chapters.

Measures of the Spread of the Data

An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation.

The standard deviation is a number that measures how far data values are from their mean.

The standard deviation

•    provides a numerical measure of the overall amount of variation in a data set

•    can be used to determine whether a particular data value is close to or far from the mean

The standard deviation provides a measure of the overall variation in a data set

The standard deviation is always positive or 0. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.

Suppose that we are studying waiting times at the checkout line for customers at supermarket A and supermarket B; the average wait time at both markets is 5 minutes.  At market A, the standard deviation      for the waiting time is 2 minutes; at market B the standard deviation for the waiting time is 4 minutes.

Because market B has a higher standard  deviation,  we  know  that  there  is  more  variation  in  the  waiting times at market B. Overall, wait times at market B are more spread out from the average; wait times at market A are more concentrated near the average.

The standard deviation can be used to determine whether a data value is close to or far from the mean. Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute       at the checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 minutes. The standard deviation can be used to determine whether a data value is close to or far from the mean.

Rosa waits for 7 minutes:

  • 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation.
  • Rosa’s wait time of 7 minutes is 2 minutes longer than the average of 5 minutes.
  • Rosa’s wait time of 7 minutes is one standard deviation above the average of 5 minutes.

Binh waits for 1 minute.

  • 1 is 4 minutes less than the average of 5;
  • 4 minutes is equal to two standard deviations.
  • Binh’s wait time of 1 minute is 4 minutes less than the average of 5 minutes.
  • Binh’s wait time of 1 minute is two standard deviations below the average of 5 minutes.

A data value that is two standard deviations from the average is just on the borderline for what many statisticians would consider to be far from the average. Considering data to be far from the mean if it is more than 2 standard deviations away is more of an approximate “rule of thumb” than a rigid rule. In general, the shape of the distribution of the data affects how much of the data is further away than 2 standard deviations. (We will learn more about this in later chapters.)

If 1 were also part of the data set, then 1 is two standard deviations to the left of 5 because

5+(-2)(2) = 1

  • In general, a value = mean + (#ofSTDEV)(standard deviation)
  • Where #ofSTDEVs = the number of standard deviations
  • 7 is one standard deviation more than the mean of 5 because: 7=5+ (1) (2)
  • 1 is two standard deviations less than the mean of 5 because: 1=5+ ( − 2) (2)

The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for a sample and for a population:

x = \bar{x} + (#\;of\;STDEV)(s)

Calculating the Standard Deviation

x-\mu

If the numbers come from a census of the entire population and not a sample, when we calculate the aver- age of the squared deviations to find the variance, we divide by N , the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n-1 , one less than the number of items in the sample. You can see that in the formulas below.

Formulas for the Sample Standard Deviation

s=\sqrt{\frac{\sum\left(x-\bar{x}\right)^2}{n-1}}

For the sample standard deviation, the denominator is n-1 , that is the sample size MINUS 1.

Formulas for the Population Standard Deviation

\sigma=\sqrt{\frac{\sum\left(x-\bar{mu}\right)^2}{N}}

For the population standard deviation, the denominator is N , the number of items in the population.

In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f is 1. If a value appears three times in the data set or population, f is 3.

Sampling Variability of a Statistic

\frac{\sigma}}{\sqrt{n}}

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages of her students. The following data are the ages for a SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year:

9 ; 9.5 ; 9.5 ; 10 ; 10 ; 10 ; 10 ; 10.5 ; 10.5 ; 10.5 ; 10.5 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11.5 ; 11.5 ; 11.5

x=\frac{9+9.5x2+10x4+11x6+11.5x3}{20}-10.525

The average age is 10.53 years, rounded to 2 places.

The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square root of the variance. We will explain the parts of the table after calculating s .

)
( )
9 1 9 10.525 = 1.525 1 2.325625 = 2.325625
9.5 2 9.5 10.525 = 1.025 2 1.050625 = 2.101250
10 4 10 10.525 = 0.525 4 .275625 = 1.1025
10.5 4 10.5 10.525 = 0.025 4 .000625 = .0025
11 6 11 10.525 = 0.475 6 .225625 = 1.35375
11.5 3 11.5 10.525 = 0.975 3 .950625 = 2.851875

The sample variance, s 2, is equal to the sum of the last column (9.7375) divided by the total number of data values minus one (20 – 1):

s^2= \frac{9.7375}{20-1}

The sample standard deviation s is equal to the square root of the sample variance:

s=\sqrt{0.5125}= .0715891

Rounded to two decimal places, s = 0.72

Typically, you do the calculation for the standard deviation on your calculator or computer . The intermediate results are not rounded. This is done for accuracy.

\bar{x} + 1s

Solution – Example 15

Find the value that is two standard deviations below the mean. Find ( x − 2 s ).

Solution – Example 16

\bar{x} - 1s

Find the values that are 1.5 standard deviations from (below and above) the mean.

\bar{x} - 1.5s

Explanation of the standard deviation calculation shown in the table

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when the data value is greater than the mean. A negative deviation occurs when the data value is less than the mean; the deviation is -1.525 for the data value 9. If you add the deviations, the sum is always zero . (For this example, there are n=20 deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you make them positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation.

The variance is a squared measure and does not have the same units as the data. Taking the square root  solves the problem. The standard deviation measures the spread in the same units as the data.

Notice that instead of dividing by n=20, the calculation divided by n-1=20-1=19 because the data is a sample. For the sample variance, we divide by the sample size minus one ( n 1). Why not divide by n ? The answer has to do with the population variance. The sample variance is an estimate of the population variance. Based on the theoretical mathematics that lies behind these calculations, dividing by ( n 1) gives a better estimate of the population variance.

NOTE: Your concentration should be on what the standard deviation tells us about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.

The standard deviation, s or σ , is either zero or larger than zero.  When the standard deviation is 0, there is  no spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all concentrated close to the mean,  and is larger when the data values show more variation from   the mean. When the standard deviation is a lot larger than zero,  the data values are very spread out about  the mean; outliers can make s or σ very large.

The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better “feel” for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be confusing, always graph your data .

Media Attributions

Quantitative Analysis for Business Copyright © by Margo Bergman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Measurement
  • Research Design
  • Conclusion Validity
  • Data Preparation
  • Correlation
  • Inferential Statistics
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential statistics . With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what’s going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average. This single number is simply the number of hits divided by the number of times at bat (reported to three significant digits). A batter who is hitting .333 is getting a hit one time in every three at bats. One batting .250 is hitting one time in four. The single number describes a large number of discrete events. Or, consider the scourge of many students, the Grade Point Average (GPA). This single number describes the general performance of a student across a potentially wide range of course experiences.

Every time you try to describe a large set of observations with a single indicator you run the risk of distorting the original data or losing important detail. The batting average doesn’t tell you whether the batter is hitting home runs or singles. It doesn’t tell whether she’s been in a slump or on a streak. The GPA doesn’t tell you whether the student was in difficult courses or easy ones, or whether they were courses in their major field or in other disciplines. Even given these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units.

Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are three major characteristics of a single variable that we tend to look at:

  • the distribution
  • the central tendency
  • the dispersion

In most situations, we would describe all three of these characteristics for each of the variables in our study.

The Distribution

The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years. Or, we describe gender by listing the number or percent of males and females. In these cases, the variable has few enough values that we can list each one and summarize how many sample cases had the value. But what do we do for a variable like income or GPA? With these variables there can be a large number of possible values, with relatively few people having each one. In this case, we group the raw scores into categories according to ranges of values. For instance, we might look at GPA according to the letter grade ranges. Or, we might group income into four or five ranges of income values.

CategoryPercent
Under 35 years old9%
36–4521%
46–5545%
56–6519%
66+6%

One of the most common ways to describe a single variable is with a frequency distribution . Depending on the particular variable, all of the data values may be represented, or you may group the values into categories first (e.g. with age, price, or temperature variables, it would usually not be sensible to determine the frequencies for each value. Rather, the value are grouped into ranges and the frequencies determined.). Frequency distributions can be depicted in two ways, as a table or as a graph. The table above shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 1. This type of graph is often referred to as a histogram or bar chart.

Distributions may also be displayed using percentages. For example, you could use percentages to describe the:

  • percentage of people in different income levels
  • percentage of people in different age ranges
  • percentage of people in different ranges of standardized test scores

Central Tendency

The central tendency of a distribution is an estimate of the “center” of a distribution of values. There are three major types of estimates of central tendency:

The Mean or average is probably the most commonly used method of describing central tendency. To compute the mean all you do is add up all the values and divide by the number of values. For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values:

The sum of these 8 values is 167 , so the mean is 167/8 = 20.875 .

The Median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample. For example, if there are 500 scores in the list, score #250 would be the median. If we order the 8 scores shown above, we would get:

There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores are 20 , the median is 20 . If the two middle scores had different values, you would have to interpolate to determine the median.

The Mode is the most frequently occurring value in the set of scores. To determine the mode, you might again order the scores as shown above, and then count each one. The most frequently occurring value is the mode. In our example, the value 15 occurs three times and is the model. In some distributions there is more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values ( 20.875 , 20 , and 15 ) for the mean, median and mode respectively. If the distribution is truly normal (i.e. bell-shaped), the mean, median and mode are all equal to each other.

Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15 , so the range is 36 - 15 = 21 .

The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the mean of the sample. Again lets take the set of scores:

to compute the standard deviation, we first find the distance between each value and the mean. We know from above that the mean is 20.875 . So, the differences from the mean are:

Notice that values that are below the mean have negative discrepancies and values above it have positive ones. Next, we square each discrepancy:

Now, we take these “squares” and sum them to get the Sum of Squares (SS) value. Here, the sum is 350.875 . Next, we divide this sum by the number of scores minus 1 . Here, the result is 350.875 / 7 = 50.125 . This value is known as the variance . To get the standard deviation, we take the square root of the variance (remember that we squared the deviations earlier). This would be SQRT(50.125) = 7.079901129253 .

Although this computation may seem convoluted, it’s actually quite simple. To see this, consider the formula for the standard deviation:

  • X is each score,
  • X̄ is the mean (or average),
  • n is the number of values,
  • Σ means we sum across the values.

In the top part of the ratio, the numerator, we see that each score has the mean subtracted from it, the difference is squared, and the squares are summed. In the bottom part, we take the number of scores minus 1 . The ratio is the variance and the square root is the standard deviation. In English, we can describe the standard deviation as:

the square root of the sum of the squared deviations from the mean divided by the number of scores minus one.

Although we can calculate these univariate statistics by hand, it gets quite tedious when you have more than a few values and variables. Every statistics program is capable of calculating them easily for you. For instance, I put the eight scores into SPSS and got the following table as a result:

MetricValue
N8
Mean20.8750
Median20.0000
Mode15.00
Standard Deviation7.0799
Variance50.1250
Range21.00

which confirms the calculations I did by hand above.

The standard deviation allows us to reach some conclusions about specific scores in our distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following conclusions can be reached:

  • approximately 68% of the scores in the sample fall within one standard deviation of the mean
  • approximately 95% of the scores in the sample fall within two standard deviations of the mean
  • approximately 99% of the scores in the sample fall within three standard deviations of the mean

For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799 , we can from the above statement estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348 . This kind of information is a critical stepping stone to enabling us to compare the performance of an individual on one variable with their performance on another, even when the variables are measured on entirely different scales.

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

Statology

Descriptive vs. Inferential Statistics: What’s the Difference?

There are two main branches in the field of statistics:

  • Descriptive Statistics

Inferential Statistics

This tutorial explains the difference between the two branches and why each one is useful in certain situations.

Descriptive  Statistics

In a nutshell,  descriptive statistics  aims to  describe  a chunk of raw data using summary statistics, graphs, and tables.

Descriptive statistics are useful because they allow you to understand a group of data much more quickly and easily compared to just staring at rows and rows of raw data values.

For example, suppose we have a set of raw data that shows the test scores of 1,000 students at a particular school. We might be interested in the average test score along with the distribution of test scores.

Using descriptive statistics, we could find the average score and create a graph that helps us visualize the distribution of scores.

This allows us to understand the test scores of the students much more easily compared to just staring at the raw data.

Common Forms of Descriptive Statistics

There are three common forms of descriptive statistics:

1. Summary statistics.  These are statistics that  summarize  the data using a single number. There are two popular types of summary statistics:

  • Measures of central tendency : these numbers describe where the center of a dataset is located. Examples include the  mean   and the  median .
  • Measures of dispersion : these numbers describe how spread out the values are in the dataset. Examples include the  range ,  interquartile range ,  standard deviation , and  variance .

2. Graphs . Graphs help us visualize data. Common types of graphs used to visualize data include boxplots , histograms , stem-and-leaf plots , and scatterplots .

3. Tables . Tables can help us understand how data is distributed. One common type of table is a  frequency table , which tells us how many data values fall within certain ranges. 

Example of Using Descriptive Statistics

The following example illustrates how we might use descriptive statistics in the real world.

Suppose 1,000 students at a certain school all take the same test. We are interested in understanding the distribution of test scores, so we use the following descriptive statistics:

1. Summary Statistics

Mean: 82.13 . This tells us that the average test score among all 1,000 students is 82.13.

Median: 84.  This tells us that half of all students scored higher than 84 and half scored lower than 84.

Max: 100. Min: 45.  This tells us the maximum score that any student obtained was 100 and the minimum score was 45. The  range – which tells us the difference between the max and the min – is 55.

To visualize the distribution of test scores, we can create a histogram – a type of chart that uses rectangular bars to represent frequencies.

descriptive statistics in business research

Based on this histogram, we can see that the distribution of test scores is roughly bell-shaped. Most of the students scored between 70 and 90, while very few scored above 95 and fewer still scored below 50.

Another easy way to gain an understanding of the distribution of scores is to create a frequency table. For example, the following frequency table shows what percentage of students scored between various ranges:

descriptive statistics in business research

We can see that just 4% of the total students scored above a 95. We can also see that (12% + 9% + 4% = ) 25% of all students scored an 85 or higher.

A frequency table is particularly helpful if we want to know what percentage of the data values fall above or below a certain value. For example, suppose the school considers an “acceptable” test score to be any score above a 75.

By looking at the frequency table, we can easily see that (20% + 22% + 12% + 9% + 4% = ) 67% of the students received an acceptable test score.

In a nutshell,  inferential statistics  uses a small sample of data to draw  inferences  about the larger population that the sample came from.

For example, we might be interested in understanding the political preferences of millions of people in a country.

However, it would take too long and be too expensive to actually survey every individual in the country. Thus, we would instead take a smaller survey of say, 1,000 Americans, and use the results of the survey to draw inferences about the population as a whole.

This is the whole premise behind inferential statistics – we want to answer some question about a population, so we obtain data for a small sample of that population and use the data from the sample to draw inferences about the population.

The Importance of a Representative Sample

In order to be confident in our ability to use a sample to draw inferences about a population, we need to make sure that we have a  representative sample   – that is, a sample in which the characteristics of the individuals in the sample closely match the characteristics of the overall population.

Ideally, we want our sample to be like a “mini version” of our population. So, if we want to draw inferences on a population of students composed of 50% girls and 50% boys, our sample would not be representative if it included 90% boys and only 10% girls.

descriptive statistics in business research

If our sample is not similar to the overall population, then we cannot generalize the findings from the sample to the overall population with any confidence.

How to Obtain a Representative Sample

To maximize the chances that you obtain a representative sample, you need to focus on two things:

1. Make sure you use a random sampling method.

There are several different random sampling methods that you can use that are likely to produce a representative sample, including:

  • A simple random sample
  • A systematic random sample
  • A cluster random sample
  • A stratified random sample

Random sampling methods tend to produce representative samples because every member of the population has an equal chance of being included in the sample.

2. Make sure your sample size is large enough . 

Along with using an appropriate sampling method, it’s important to ensure that the sample is large enough so that you have enough data to generalize to the larger population.

To determine how large your sample should be, you have to consider the population size you’re studying, the confidence level you’d like to use, and the margin of error you consider to be acceptable.

Fortunately, you can use online calculators to plug in these values and see how large your sample needs to be.

Common Forms of Inferential Statistics

There are three common forms of inferential statistics:

1. Hypothesis Tests.

Often we’re interested in answering questions about a population such as:

  • Is the percentage of people in Ohio in support of candidate A higher than 50%?
  • Is the mean height of a certain plant equal to 14 inches?
  • Is there a difference between the mean height of students at School A compared to School B?

To answer these questions we can perform a hypothesis test , which allows us to use data from a sample to draw conclusions about populations.

2. Confidence Intervals . 

Sometimes we’re interested in estimating some value for a population. For example, we might be interested in the mean height of a certain plant species in Australia.

Instead of going around and measuring every single plant in the country, we might collect a small sample of plants and measure each one. Then, we can use the mean height of the plants in the sample to estimate the mean height for the population.

However, our sample is unlikely to provide a perfect estimate for the population. Fortunately, we can account for this uncertainty by creating a confidence interval , which provides a range of values that we’re confident the true population parameter falls in.

For example, we might produce a 95% confidence interval of [13.2, 14.8], which says we’re 95% confident that the true mean height of this plant species is between 13.2 inches and 14.8 inches.

3. Regression .

Sometimes we’re interested in understanding the relationship between two variables in a population.

For example, suppose we want to know if  hours spent studying per week  is related to  test scores . To answer this question, we could perform a technique known as  regression analysis .

So, we may observe the number of hours studied along with the test scores for 100 students and perform a regression analysis to see if there is a significant relationship between the two variables.

If the p-value of the regression turns out to be significant , then we can conclude that there is a significant relationship between these two variables in the overall population of students.

The Difference Between Descriptive and Inferential Statistics

In summary, the difference between descriptive and inferential statistics can be described as follows:

Descriptive statistics  use summary statistics, graphs, and tables to describe  a data set.

This is useful for helping us gain a quick and easy understanding of a data set without pouring over all of the individual data values.

Inferential statistics  use samples to draw  inferences  about larger populations.

Depending on the question you want to answer about a population, you may decide to use one or more of the following methods: hypothesis tests, confidence intervals, and regression analysis.

If you do choose to use one of these methods, keep in mind that your sample needs to be representative of your population , or the conclusions you draw will be unreliable.

Featured Posts

descriptive statistics in business research

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

3 Replies to “Descriptive vs. Inferential Statistics: What’s the Difference?”

Wow! Awesome! So easily explained! I finally understood and know now how to create and answer my questions! Thank you!

I just came across this site and all I can say is “I love you Sir”

This site is the real treasure I was lucky to find. Thanks a million, Zach Bobbitt!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

descriptive statistics in business research

  • Onsite training

3,000,000+ delegates

15,000+ clients

1,000+ locations

  • KnowledgePass
  • Log a ticket

01344203999 Available 24/7

descriptive statistics in business research

Descriptive Statistics: Definitions, Types, and Examples

Curious about Descriptive Statistics? This branch of statistics involves summarizing and describing the essential features of a data set. In this blog, we will explore key concepts of Descriptive Statistics like mean, median, mode, and standard deviation, and how they provide insights into data trends. Let's dive into Descriptive Statistics!

stars

Exclusive 40% OFF

Training Outcomes Within Your Budget!

We ensure quality, budget-alignment, and timely delivery by our expert instructors.

Share this Resource

  • Introduction to Business Analytics Training
  • Statistical Process Control Training
  • Mathematical Optimisation for Business Problems
  • Mind Mapping Training

course

Ever stared at a massive dataset, feeling overwhelmed by numbers? Making sense of raw data can be daunting, but fear not! Descriptive Statistics is your key to unlocking hidden patterns and insights. Imagine transforming a chaotic collection of numbers into a clear and compelling story.  

This blog will guide you through the world of Descriptive Statistics, breaking down complex concepts into easy-to-understand terms. From understanding central tendencies to measuring dispersion, we'll explore the different types of Descriptive Statistics and provide practical examples to illustrate their applications. Get ready to transform data into actionable information! 

Table of Contents  

1) Understanding Descriptive Statistics 

2) Types of Descriptive Statistics 

3) Primary Purpose of Descriptive Statistics 

4) Difference Between Univariate and Bivariate Statistics 

5) Difference Between Descriptive Statistics and Inferential Statistics 

6) Conclusion 

Understanding Descriptive Statistics 

Descriptive Statistics are tools used to summarise and describe the main features of a dataset. They provide simple summaries about the sample and the measures. Key aspects include measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and graphical representations (histograms, bar charts).  

These statistics help in understanding the distribution, central value, and variability of data, offering a clear overview without making any conclusions beyond the data. Descriptive Statistics are foundational in data analysis, setting the stage for more complex inferential statistics.  

Statistics Course 

Types of Descriptive Statistics 

Building on our understanding of Descriptive Statistics, let us delve into the specific types: 

For all the below cases we will be considering the following sample dataset:  

 

 

85 

92 

78 

88 

73 

95 

80 

85 

90 

87 

1) Central Tendency Measures 

Central tendency measures describe the centre of a dataset, summarising the data with a single value that represents the "typical" data point. The main measures are the mean (average), median (middle value), and mode (most frequent value). These measures provide insights into the dataset's general tendency, helping identify where most data points cluster. For example, the mean gives an overall average, the median shows the midpoint, and the mode highlights the most common occurrence, each offering a different perspective on the dataset's central point. 

Let us consider the above dataset. 

Mean: The average score is calculated by summing all the scores and dividing by the number of students. 

Central Tendency Measures

Median : The middle value when the scores are arranged in ascending order. For an even number of scores, the median is the average of the two middle numbers. 

Ordered Scores: 73, 78, 80, 85, 85, 87, 88, 90, 92, 95 

Central Tendency Measures

Mode : The most frequent score in the dataset, which is 85. 

2) Distribution Analysis 

Distribution analysis examines the spread and shape of data within a dataset, providing insights into patterns, trends, and potential anomalies. It involves evaluating the data's frequency distribution, using graphical tools like histograms and probability density functions.  

This analysis helps in understanding whether the data is skewed, symmetric, or has outliers. It also includes looking at the data's range, quartiles, and any peaks or troughs, which can indicate important characteristics such as modality and kurtosis, essential for identifying the nature and tendencies within the data.

Distribution Analysis

In this dataset, scores like 85, 87, and 90 occur around the middle range, with a few scores at the extremes (73 and 95). The distribution can be assessed for skewness (if the scores are more spread out on one side) or for bimodality (if there are two peaks). 

Enhance quality management—Sign up for our Statistical Process Control Training today!  

3) Variability Measures 

Variability measures, also known as dispersion measures, quantify the extent of spread in a dataset. Key metrics include range (difference between the highest and lowest values), variance (average of squared differences from the mean), and standard deviation (square root of variance). These measures help in understanding the distribution's spread, showing how much the data points differ from the central tendency. High variability indicates that data points are spread out over a wider range of values, while low variability suggests they are closer to the mean. 

Let us once again consider the previously established dataset 

Range : The difference between the highest and lowest scores. 

Variability Measures

Variance : Measures the average squared deviation from the mean. 

How to find Variance

Standard Deviation (σ) : The square root of the variance, providing a measure of spread in the same units as the data. 

Standard Deviation

4) Univariate Descriptive Statistics 

Univariate Descriptive Statistics focus on summarising and analysing a single variable within a dataset. This analysis includes calculating central tendency and variability measures, as well as using graphical representations like box plots and histograms to visualise data distribution. Univariate analysis provides a comprehensive overview of the variable's characteristics, such as its typical value, spread, and any potential anomalies or outliers. This foundational analysis is crucial for understanding the basic properties of the data before moving on to more complex multivariate analyses.

Univariate Descriptive Statistics

The Primary Purpose of Descriptive Statistics 

The primary purpose of Descriptive Statistics is to provide a concise summary and understanding of a dataset's main characteristics. By using measures of central tendency (such as mean, median, and mode), measures of variability (including range, variance, and standard deviation), and distribution analysis, Descriptive Statistics help to simplify large amounts of data into comprehensible formats.  

This enables researchers and analysts to identify patterns, trends, and anomalies within the data. Descriptive Statistics also serve as a foundation for further statistical analysis, allowing for the comparison of datasets and the identification of relationships between variables. By presenting data in a clear and organised manner, these statistics make it easier to communicate findings and support decision-making processes across various fields, such as business, healthcare, and social sciences. 

Unlock the power of data with our Introduction to Statistics Course . Join now and start your journey towards data-driven decision-making!  

Difference Between Univariate and Bivariate Statistics 

Univariate statistics involve the analysis of a single variable, focusing on its distribution, central tendency, and variability. This type of analysis provides insights into the characteristics of that specific variable, such as its mean, median, mode, range, variance, and standard deviation. It's useful for summarising and understanding the data's basic properties without considering relationships with other variables. 

In contrast, bivariate statistics examine the relationship between two variables. This analysis explores how changes in one variable correlate with changes in another, often using measures like correlation coefficients, regression analysis, and scatter plots. Bivariate analysis helps identify and quantify associations, trends, or patterns between variables, providing deeper insights into potential causal relationships or dependencies. While univariate analysis focuses solely on individual variable properties, bivariate analysis seeks to understand interactions between pairs of variables. 

Difference Between Univariate and Bivariate Statistics

Difference Between Descriptive Statistics and Inferential Statistics 

Descriptive Statistics and inferential statistics serve different purposes in data analysis. Descriptive Statistics focus on summarising and describing the main features of a dataset. They provide simple summaries and visualisations, such as mean, median, mode, range, and standard deviation, to help understand the basic characteristics of the data. These statistics do not involve generalisations beyond the data at hand. 

Inferential statistics, on the other hand, involve making predictions or inferences about a population based on a sample. This branch of statistics uses methods such as hypothesis testing, confidence intervals, and regression analysis to draw conclusions and make decisions. Inferential statistics allow researchers to estimate population parameters and test theories, even when data from the entire population is not available. While Descriptive Statistics offer a snapshot of the data, inferential statistics extend that snapshot to make broader conclusions and predictions. 

Difference Between Descriptive Statistics and Inferential Statistics

Maximise efficiency and profits—Join our Mathematical Optimisation for Business Problems Course and master the art of optimal decision-making!  

Conclusion 

Descriptive Statistics are your first step in understanding data. By mastering measures of central tendency, dispersion, and distribution, you can extract meaningful insights. Remember, effective data analysis begins with a solid grasp of Descriptive Statistics. Now, go forth and explore your data! 

Advance your career—Register for our Business Analyst Courses today and become a data-driven decision-maker.  

Frequently Asked Questions

Descriptive Statistics require data collection, organisation, and summary. They need accurate measurements of central tendency, variability, and distribution to effectively describe and summarise the dataset's main characteristics. 

Descriptive Statistics do not provide information about cause-and-effect relationships, nor do they offer predictions or inferences about a population based on a sample. They solely summarise and describe the data without drawing broader conclusions. 

The main concern of Descriptive Statistics is to summarise and describe the essential features of a dataset in a clear and understandable manner, highlighting patterns, trends, and anomalies without making generalisations beyond the observed data. 

The Knowledge Academy takes global learning to new heights, offering over 30,000 online courses across 490+ locations in 220 countries. This expansive reach ensures accessibility and convenience for learners worldwide. 

Alongside our diverse Online Course Catalogue, encompassing 17 major categories, we go the extra mile by providing a plethora of free educational Online Resources like News updates, Blogs , videos, webinars, and interview questions. Tailoring learning experiences further, professionals can maximise value with customisable Course Bundles of TKA . 

The Knowledge Academy’s Knowledge Pass , a prepaid voucher, adds another layer of flexibility, allowing course bookings over a 12-month period. Join us on a journey where education knows no bounds. 

The Knowledge Academy offers various Business Analyst Courses , including the Certified Business Analyst Professional (CBA-PRO) Course, Introduction to Statistics Course and the Introduction to Business Analytics Training. These courses cater to different skill levels, providing comprehensive insights into Standard Deviation in Statistics .  

Our Business Analysis Blogs cover a range of topics related to Statistics, offering valuable resources, best practices, and industry insights. Whether you are a beginner or looking to advance your Business Analyst skills, The Knowledge Academy's diverse courses and informative blogs have got you covered. 

Upcoming Business Analysis Resources Batches & Dates

Fri 6th Sep 2024

Fri 20th Dec 2024

Get A Quote

WHO WILL BE FUNDING THE COURSE?

My employer

By submitting your details you agree to be contacted in order to respond to your enquiry

  • Business Analysis
  • Lean Six Sigma Certification

Share this course

Our biggest summer sale.

red-star

We cannot process your enquiry without contacting you, please tick to confirm your consent to us for contacting you about your enquiry.

By submitting your details you agree to be contacted in order to respond to your enquiry.

We may not have the course you’re looking for. If you enquire or give us a call on 01344203999 and speak to our training experts, we may still be able to help with your training requirements.

Or select from our popular topics

  • ITIL® Certification
  • Scrum Certification
  • ISO 9001 Certification
  • Change Management Certification
  • Microsoft Azure Certification
  • Microsoft Excel Courses
  • Explore more courses

Press esc to close

Fill out your  contact details  below and our training experts will be in touch.

Fill out your   contact details   below

Thank you for your enquiry!

One of our training experts will be in touch shortly to go over your training requirements.

Back to Course Information

Fill out your contact details below so we can get in touch with you regarding your training requirements.

* WHO WILL BE FUNDING THE COURSE?

Preferred Contact Method

No preference

Back to course information

Fill out your  training details  below

Fill out your training details below so we have a better idea of what your training requirements are.

HOW MANY DELEGATES NEED TRAINING?

HOW DO YOU WANT THE COURSE DELIVERED?

Online Instructor-led

Online Self-paced

WHEN WOULD YOU LIKE TO TAKE THIS COURSE?

Next 2 - 4 months

WHAT IS YOUR REASON FOR ENQUIRING?

Looking for some information

Looking for a discount

I want to book but have questions

One of our training experts will be in touch shortly to go overy your training requirements.

Your privacy & cookies!

Like many websites we use cookies. We care about your data and experience, so to give you the best possible experience using our site, we store a very limited amount of your data. Continuing to use this site or clicking “Accept & close” means that you agree to our use of cookies. Learn more about our privacy policy and cookie policy cookie policy .

We use cookies that are essential for our site to work. Please visit our cookie policy for more information. To accept all cookies click 'Accept & close'.

Root out friction in every digital experience, super-charge conversion rates, and optimise digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Meet the operating system for experience management

  • Free Account
  • Product Demos
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence
  • Market Research
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results.

language

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Analysis

Descriptive Statistics

Try Qualtrics for free

Descriptive statistics in research: a critical component of data analysis.

15 min read With any data, the object is to describe the population at large, but what does that mean and what processes, methods and measures are used to uncover insights from that data? In this short guide, we explore descriptive statistics and how it’s applied to research.

What do we mean by descriptive statistics?

With any kind of data, the main objective is to describe a population at large — and using descriptive statistics, researchers can quantify and describe the basic characteristics of a given data set.

For example, researchers can condense large data sets, which may contain thousands of individual data points or observations, into a series of statistics that provide useful information on the population of interest. We call this process “describing data”.

In the process of producing summaries of the sample, we use measures like mean, median, variance, graphs, charts, frequencies, histograms, box and whisker plots, and percentages. For datasets with just one variable, we use univariate descriptive statistics. For datasets with multiple variables, we use bivariate correlation and multivariate descriptive statistics.

Want to find out the definitions? Univariate descriptive statistics: this is when you want to describe data with only one characteristic or attribute

Bivariate correlation: this is when you simultaneously analyse (compare) two variables to see if there is a relationship between them

Multivariate descriptive statistics: this is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable

Then, after describing and summarising the data, as well as using simple graphical analyses, we can start to draw meaningful insights from it to help guide specific strategies. It’s also important to note that descriptive statistics can employ and use both quantitative and  qualitative research .

Describing data is undoubtedly the most critical first step in research as it enables the subsequent organisation, simplification and summarisation of information — and every survey question and population has summary statistics. Let’s take a look at a few examples.

Examples of descriptive statistics

Consider for a moment a number used to summarise how well a striker is performing in football — goals scored per game. This number is simply the number of shots taken against how many of those shots hit the back of the net (reported to three significant digits). If a striker is scoring 0.333, that’s one goal for every three shots. If they’re scoring one in four, that’s 0.250.

A classic example is a student’s grade point average (GPA). This single number describes the general performance of a student across a range of course experiences and classes. It doesn’t tell us anything about the difficulty of the courses the student is taking, or what those courses are, but it does provide a summary that enables a degree of comparison with people or other units of data.

Ultimately, descriptive statistics make it incredibly easy for people to understand complex (or data intensive) quantitative or qualitative insights across large data sets.

Take your research and subsequent analysis to the next level

Types of descriptive statistics

To quantitatively summarise the characteristics of raw, ungrouped data, we use the following types of descriptive statistics:

  • Measures of Central Tendency ,
  • Measures of Dispersion  and
  • Measures of Frequency Distribution.

Following the application of any of these approaches, the raw data then becomes ‘grouped’ data that’s logically organised and easy to understand. To visually represent the data, we then use graphs, charts, tables etc.

Let’s look at the different types of measurement and the statistical methods that belong to each:

Measures of Central Tendency  are used to describe data by determining a single representative of central value. For example, the mean, median or mode.

Measures of Dispersion  are used to determine how spread out a data distribution is with respect to the central value, e.g. the mean, median or mode. For example, while central tendency gives the person the average or central value, it doesn’t describe how the data is distributed within the set.

Measures of Frequency Distribution  are used to describe the occurrence of data within the data set (count).

The methods of each measure are summarised in the table below:

Measures of Central Tendency Measures of Dispersion Measures of Frequency Distribution
Mean Range Count
Median Standard deviation
Mode Quartile deviation
Variance
Absolute deviation

Mean:  The most popular and well-known measure of central tendency. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

Median:  The median is the middle score for a set of data that has been arranged in order of magnitude. If you have an even number of data, e.g. 10 data points, take the two middle scores and average the result.

Mode:  The mode is the most frequently occurring observation in the data set.  

Range:  The difference between the highest and lowest value.

Standard deviation:  Standard deviation measures the dispersion of a data set relative to its mean and is calculated as the square root of the variance.

Quartile deviation : Quartile deviation measures the deviation in the middle of the data.

Variance:  Variance measures the variability from the average of mean.

Absolute deviation:  The absolute deviation of a dataset is the average distance between each data point and the mean.

Count:  How often each value occurs.

Scope of descriptive statistics in research

Descriptive statistics (or analysis) is considered more vast than other quantitative and qualitative methods as it provides a much broader picture of an event, phenomenon or population.

But that’s not all: it can use any number of variables, and as it collects data and describes it as it is, it’s also far more representative of the world as it exists.

However, it’s also important to consider that descriptive analyses lay the foundation for further methods of study. By summarising and condensing the data into easily understandable segments, researchers can further analyse the data to uncover new variables or hypotheses.

Mostly, this practice is all about the ease of data visualisation. With data presented in a meaningful way, researchers have a simplified interpretation of the data set in question. That said, while descriptive statistics helps to summarise information, it only provides a general view of the variables in question.

It is, therefore, up to the researchers to probe further and use other methods of analysis to discover deeper insights.

Things you can do with descriptive statistics:

  • Define subject characteristics:  If a marketing team wanted to build out accurate buyer personas for specific products and industry verticals, they could use descriptive analyses on customer datasets (procured via a survey) to identify consistent traits and behaviours.

They could then ‘describe’ the data to build a clear picture and understanding of who their buyers are, including things like preferences, business challenges, income and so on.

  • Measure data trends

Let’s say you wanted to assess propensity to buy over several months or years for a specific target market and product. With descriptive statistics, you could quickly summarise the data and extract the precise data points you need to understand the trends in product purchase behaviour.

  • Compare events, populations or phenomena

How do different demographics respond to certain variables? For example, you might want to run a customer study to see how buyers in different job functions respond to new product features or price changes. Are all groups as enthusiastic about the new features and likely to buy? Or do they have reservations? This kind of data will help inform your overall product strategy and potentially how you tier solutions.

  • Validate existing conditions

When you have a belief or hypothesis but need to prove it, you can use descriptive techniques to ascertain underlying patterns or assumptions.

  • Form new hypotheses

With the data presented and surmised in a way that everyone can understand (and infer connections from), you can delve deeper into specific data points to uncover deeper and more meaningful insights — or run more comprehensive research.

Guiding your survey design to improve the data collected

To use your surveys as an effective tool for customer engagement and understanding, every survey goal and item should answer one simple, yet highly important question:

“What am I really asking?”

It might seem trivial, but by having this question frame survey research, it becomes significantly easier for researchers to develop the  right questions  that uncover useful, meaningful and actionable insights.

Planning becomes easier, questions clearer and perspective far wider and yet nuanced.

Hypothesise — what’s the problem that you’re trying to solve? Far too often, organisations collect data without understanding what they’re asking, and why they’re asking it.

Finally, focus on the end result. What kind of data do you need to answer your question? Also, are you asking a quantitative or qualitative question? Here are a few things to consider:

  • Clear questions are clear for everyone. It takes time to make a concept clear
  • Ask about measurable, evident and noticeable activities or behaviours.
  • Make rating scales easy. Avoid long lists, confusing scales or “don’t know” or “not applicable” options.
  • Ensure your survey makes sense and flows well. Reduce the cognitive load on respondents by making it easy for them to complete the survey.
  • Read your questions aloud to see how they sound.
  • Pretest by asking a few uninvolved individuals to answer.

Furthermore…

As well as understanding what you’re really asking, there are several other considerations for your data:

  • Keep it random

How you select your sample is what makes your research replicable and meaningful. Having a truly random sample helps prevent bias, increasingly the quality of evidence you find.

  • Plan for and avoid sample error

Before starting your research project, have a clear plan for avoiding sample error. Use larger sample sizes, and apply random sampling to minimise the potential for bias.

  • Don’t over sample

Remember, you can sample 500 respondents selected randomly from a population and they will closely reflect the actual population 95% of the time.

  • Think about the mode

Match your survey methods to the sample you select. For example, how do your current customers prefer communicating? Do they have any shared characteristics or preferences? A mixed-method approach is critical if you want to drive action across different customer segments.

Use a survey tool that supports you with the whole process

Surveys created using a survey research software can support researchers in a number of ways:

  • Employee satisfaction  survey template
  • Employee exit  survey template
  • Customer satisfaction (CSAT)  survey template
  • Ad testing  survey template
  • Brand awareness  survey template
  • Product pricing  survey template
  • Product research  survey template
  • Employee engagement  survey template
  • Customer service  survey template
  • NPS  survey template
  • Product package testing  survey template
  • Product features prioritisation  survey template

These considerations have been included in  Qualtrics’ survey software , which summarises and creates visualisations of data, making it easy to access insights, measure trends, and examine results without complexity or jumping between systems.

Uncover your next breakthrough idea with Stats iQ™

What makes Qualtrics so different from other survey providers is that it is built in consultation with trained research professionals and includes  high-tech statistical software like Qualtrics Stats iQ .

With just a click, the software can run specific analyses or automate statistical testing and data visualisation. Testing parameters are automatically chosen based on how your data is structured (e.g. categorical data will run a statistical test like Chi-squared), and the results are translated into plain language that anyone can understand and put into action.

  • Get more meaningful insights from your data

Stats iQ includes a variety of statistical analyses, including: describe, relate, regression, cluster, factor, TURF, and pivot tables — all in one place!

  • Confidently analyse complex data

Built-in artificial intelligence and advanced algorithms automatically choose and apply the right statistical analyses and return the insights in plain english so everyone can take action.

  • Integrate existing statistical workflows

For more experienced stats users, built-in R code templates allow you to run even more sophisticated analyses by adding R code snippets directly in your survey analysis.

         Advanced statistical analysis methods available in Stats iQ

Regression analysis – Measures the degree of influence of independent variables on a dependent variable (the relationship between two or multiple variables).

Analysis of Variance (ANOVA) test  – Commonly used with a regression study to find out what effect independent variables have on the dependent variable. It can compare multiple groups simultaneously to see if there is a relationship between them.

Conjoint analysis  – Asks people to make trade-offs when making decisions, then analyses the results to give the most popular outcome. Helps you understand why people make the complex choices they do.

T-Test  – Helps you compare whether two data groups have different mean values and allows the user to interpret whether differences are meaningful or merely coincidental.

Crosstab analysis  – Used in quantitative  market research to analyse categorical data – that is, variables that are different and mutually exclusive, and allows you to compare the relationship between two variables in contingency tables.

Go from insights to action

Now that you have a better understanding of descriptive statistics in research and how you can leverage statistical analysis methods correctly, now’s the time to utilise a tool that can take your research and subsequent analysis to the next level.

Try out a Qualtrics survey software demo so you can see how it can take you through  descriptive research  and further research projects from start to finish.

Related resources

Analysis & Reporting

Statistical Significance Calculator 18 min read

Zero-party data 12 min read, what is social media analytics in 2023 13 min read, topic modelling 16 min read, margin of error 11 min read, text analysis 44 min read, sentiment analysis 21 min read, request demo.

Ready to learn more about Qualtrics?

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Class 8 Maths Notes
  • Class 9 Maths Notes
  • Class 10 Maths Notes
  • Class 11 Maths Notes
  • Class 12 Maths Notes

Descriptive Statistics

Descriptive statistics is a subfield of statistics that deals with characterizing the features of known data. Descriptive statistics give summaries of either population or sample data. Aside from descriptive statistics, inferential statistics is another important discipline of statistics used to draw conclusions about population data.

Descriptive statistics is divided into two categories:

Measures of Central Tendency

Measures of dispersion.

In this article, we will learn about descriptive statistics, including their many categories, formulae, and examples in detail.

What is Descriptive Statistics?

Descriptive statistics is a branch of statistics focused on summarizing, organizing, and presenting data in a clear and understandable way. Its primary aim is to define and analyze the fundamental characteristics of a dataset without making sweeping generalizations or assumptions about the entire data set.

The main purpose of descriptive statistics is to provide a straightforward and concise overview of the data, enabling researchers or analysts to gain insights and understand patterns, trends, and distributions within the dataset.

Descriptive statistics typically involve measures of central tendency (such as mean, median, mode), dispersion (such as range, variance, standard deviation), and distribution shape (including skewness and kurtosis). Additionally, graphical representations like charts, graphs, and tables are commonly used to visualize and interpret the data.

Histograms, bar charts, pie charts, scatter plots, and box plots are some examples of widely used graphical techniques in descriptive statistics.

Descriptive Statistics Definition

Descriptive statistics is a type of statistical analysis that uses quantitative methods to summarize the features of a population sample. It is useful to present easy and exact summaries of the sample and observations using metrics such as mean, median, variance, graphs, and charts.

Types of Descriptive Statistics

There are three types of descriptive statistics:

Measures of Frequency Distribution

The central tendency is defined as a statistical measure that may be used to describe a complete distribution or dataset with a single value, known as a measure of central tendency. Any of the central tendency measures accurately describes the whole data distribution. In the following sections, we will look at the central tendency measures, their formulae, applications, and kinds in depth.

Mean is the sum of all the components in a group or collection divided by the number of items in that group or collection. Mean of a data collection is typically represented as x̄ (pronounced “x bar”). The formula for calculating the mean for ungrouped data to express it as the measure is given as follows:

For a series of observations:

x̄ = Σx / n
  • x̄ = Mean Value of Provided Dataset
  • Σx = Sum of All Terms
  • n = Number of Terms

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50. Determine the mean weight for the provided collection of data.

Mean = Σx/n = (54 + 32 + 45 + 61 + 20 + 66 + 50)/7 = 328 / 7 = 46.85 Thus, the group’s mean weight is 46.85 kg.

Median of a data set is the value of the middle-most observation obtained after organizing the data in ascending order, which is one of the measures of central tendency. Median formula may be used to compute the median for many types of data, such as grouped and ungrouped data.

Ungrouped Data Median (n is odd): [(n + 1)/2] th  term Ungrouped Data Median (n is even): [(n / 2) th  term + ((n / 2) + 1) th  term]/2

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 66 and 50. Determine the median weight for the provided collection of data.

Arrange the provided data collection in ascending order: 20, 32, 45, 50, 54, 61, 66 Median = [(n + 1) / 2] th  term = [(7 + 1) / 2] th  term = 4 th  term = 50 Thus, group’s median weight is 50 kg.

Mode is one of the measures of central tendency, defined as the value that appears the most frequently in the provided data, i.e. the observation with the highest frequency is known as the mode of data. The mode formulae provided below can be used to compute the mode for ungrouped data.

Mode of Ungrouped Data: Most Repeated Observation in Dataset

Example: Weights of 7 girls in kg are 54, 32, 45, 61, 20, 45 and 50. Determine the mode weight for the provided collection of data.

Mode = Most repeated observation in Dataset = 45 Thus, group’s mode weight is 45 kg.

If the variability of data within an experiment must be established, absolute measures of variability should be employed. These metrics often reflect differences in a data collection in terms of the average deviations of the observations. The most prevalent absolute measurements of deviation are mentioned below. In the following sections, we will look at the variability measures, their formulae in depth.

Standard Deviation

The range represents the spread of your data from the lowest to the highest value in the distribution. It is the most straightforward measure of variability to compute. To get the range, subtract the data set’s lowest and highest values.

Range = Highest Value – Lowest Value

Example: Calculate the range of the following data series:  5, 13, 32, 42, 15, 84

Arrange the provided data series in ascending order: 5, 13, 15, 32, 42, 84 Range = H – L = 84 – 5 = 79 So, the range is 79.

Standard deviation (s or SD) represents the average level of variability in your dataset. It represents the average deviation of each score from the mean. The higher the standard deviation, the more varied the dataset is.

To calculate standard deviation, follow these six steps:

Step 1: Make a list of each score and calculate the mean.

Step 2: Calculate deviation from the mean, by subtracting the mean from each score.

Step 3: Square each of these differences.

Step 4: Sum up all squared variances.

Step 5: Divide the total of squared variances by N-1.

Step 6: Find the square root of the number that you discovered.

Example: Calculate standard deviation of the following data series:  5, 13, 32, 42, 15, 84.

Step 1: First we have to calculate the mean of following series using formula: Σx / n

Step 2: Now calculate the deviation from mean, subtract the mean from each series.

Step 3: Squared the deviation from mean and then add all the deviation.

Series

Deviation from Mean

Squared Deviation

5

5-31.83 = -26.83

719.85

13

13-31.83 = -18.83

354.57

32

32-31.83 = 0.17

0.0289

42

42-31.83 = 10.17

103.43

15

15-31.83 = -16.83

283.25

84

84-31.83 = 52.17

2721.71

Mean = 191/6 = 31.83

sum = 0

Sum = 4182.84

Step 4: Divide the squared deviation with N-1 => 4182.84 / 5 = 836.57

Step 5: √836.57 = 28.92

So, the standard deviation is 28.92

Variance is calculated as average of squared departures from the mean. Variance measures the degree of dispersion in a data collection. The more scattered the data, the larger the variance in relation to the mean. To calculate the variance, square the standard deviation.

Symbol for variance is s 2

Example: Calculate the variance of the following data series:  5, 13, 32, 42, 15, 84.

First we have to calculate the standard deviation, that we calculate above i.e. SD = 28.92 s 2 = (SD) 2 = (28.92) 2 = 836.37 So, the variance is 836.37

Mean Deviation

Mean Deviation  is used to find the average of the absolute value of the data about the mean, median, or mode. Mean Deviation is some times also known as absolute deviation. The formula mean deviation is given as follows:

Mean Deviation = ∑ n 1 |X – μ|/n
  •   μ is Central Value

Quartile Deviation

Quartile Deviation is the Half of difference between the third and first quartile. The formula for quartile deviation is given as follows:

Quartile Deviation = (Q 3 − Q 1 )/2
  •   Q 3 is Third Quartile
  • Q 1 is First Quartile

Other measures of dispersion include the relative measures also known as the coefficients of dispersion.

Datasets consist of various scores or values. Statisticians employ graphs and tables to summarize the occurrence of each possible value of a variable, often presented in percentages or numerical figures.

For instance, suppose you were conducting a poll to determine people’s favorite Beatles. You would create one column listing all potential options (John, Paul, George, and Ringo) and another column indicating the number of votes each received. Statisticians represent these frequency distributions through graphs or tables

Univariate Descriptive Statistics

Univariate descriptive statistics focus on one thing at a time. We look at each thing individually and use different ways to understand it better. Programs like SPSS and Excel can help us with this.

If we only look at the average (mean) of something, like how much people earn, it might not give us the true picture, especially if some people earn a lot more or less than others. Instead, we can also look at other things like the middle value (median) or the one that appears most often (mode). And to understand how spread out the values are, we use things like standard deviation and variance along with the range.

Bivariate Descriptive Statistics

When we have information about more than one thing, we can use bivariate or multivariate descriptive statistics to see if they are related. Bivariate analysis compares two things to see if they change together. Before doing any more complicated tests, it’s important to look at how the two things compare in the middle.

Multivariate analysis is similar to bivariate analysis, but it looks at more than two things at once, which helps us understand relationships even better.

Representations of Data in Descriptive Statistics

Descriptive statistics use a variety of ways to summarize and present data in an understandable manner. This helps us grasp the data set’s patterns, trends, and properties.

Frequency Distribution Tables: Frequency distribution tables divide data into categories or intervals and display the number of observations (frequency) that fall into each one. For example, suppose we have a class of 20 students and are tracking their test scores. We may make a frequency distribution table that contains score ranges (e.g., 0-10, 11-20) and displays how many students scored in each range.

Graphs and Charts: Graphs and charts graphically display data, making it simpler to understand and analyze. For example, using the same test score data, we may generate a bar graph with the x-axis representing score ranges and the y-axis representing the number of students. Each bar on the graph represents a score range, and its height shows the number of students scoring within that range.

These approaches help us summarize and visualize data, making it easier to discover trends, patterns, and outliers, which is critical for making informed decisions and reaching meaningful conclusions in a variety of sectors.

Descriptive Statistics Applications

Descriptive statistics are used in a variety of sectors to summarize, organize, and display data in a meaningful and intelligible way. Here are a few popular applications:

  • Business and Economics: Descriptive statistics are useful for analyzing sales data, market trends, and customer behaviour. They are used to generate averages, medians, and standard deviations in order to better evaluate product performance, pricing strategies, and financial metrics.
  • Healthcare: Descriptive statistics are used to analyze patient data such as demographics, medical histories, and treatment outcomes. They assist healthcare workers in determining illness prevalence, assessing treatment efficacy, and identifying risk factors.
  • Education: Descriptive statistics are useful in education since they summarize student performance on tests and examinations. They assist instructors in assessing instructional techniques, identifying areas for improvement, and monitoring student growth over time.
  • Market Research: Descriptive statistics are used to analyze customer preferences, product demand, and market trends. They enable businesses to make educated decisions about product development, advertising campaigns, and market segmentation.
  • Finance and investment: Descriptive statistics are used to analyze stock market data, portfolio performance, and risk management. They assist investors in determining investment possibilities, tracking asset values, and evaluating financial instruments.

Difference Between Descriptive Statistics and Inferential Statistics

Difference between Descriptive Statistics and Inferential Statistics is studied using the table added below as,

Descriptive Statistics vs Inferential Statistics

Descriptive Statistics

Does not need making predictions or generalizations outside the dataset.

This involves making forecasts or generalizations about a wider population.

Gives basic summary of the sample.

Concludes about the population based on the sample.

include mean, median, mode, standard deviation, etc.

include hypothesis testing, confidence intervals, regression analysis, etc.

Focuses on the properties of the current dataset.

Concentrates on drawing conclusions about the population from sample data.

Helpful for comprehending data patterns and linkages.

Useful for making judgements, predictions, and drawing inferences that go beyond the observed facts.

Example of Descriptive Statistics Examples

Example 1: Calculate the Mean, Median and Mode for the following series: {4, 8, 9, 10, 6, 12, 14, 4, 5, 3, 4}

First, we are going to calculate the mean. Mean = Σx / n = (4 + 8 + 9 + 10 + 6 + 12 + 14 + 4 + 5 + 3 + 4)/11 = 79 / 11 = 7.1818 Thus, the Mean is 7.1818. Now, we are going to calculate the median. Arrange the provided data collection in ascending order: 3, 4, 4, 4, 5, 6, 8, 9, 10, 12, 14 Median = [(n + 1) / 2] th  term = [(11 + 1) / 2] th  term = 6 th  term = 6 Thus, the median is 6. Now, we are going to calculate the mode. Mode = The most repeated observation in the dataset = 4 Thus, the mode is 4.

Example 2: Calculate the Range for the following series: {4, 8, 9, 10, 6, 12, 14, 4, 5, 3, 4}

Arrange the provided data series in ascending order: 3, 4, 4, 4, 5, 6, 8, 9, 10, 12, 14 Range = H – L = 14 – 3 = 11 So, the range is 11.

Example 3: Calculate the standard deviation and variance of following data: {12, 24, 36, 48, 10, 18}

First we are going to compute standard deviation. For standard deviation calculate the mean, deviation from mean and squared deviation.

Series

Deviation from Mean

Squared Deviation

12

12-24.66 = -12.66

160.28

24

24-24.66 = -0.66

0.436

36

36-24.66 = 11.34

128.595

48

48-24.66 = 23.34

544.76

10

10-24.66 = -14.66

214.92

18

18-24.66 = -6.66

44.36

Mean = 148/6 = 24.66

sum = 0

Sum = 1093.351

Dividing squared deviation with N-1 => 1093.351 / 5 = 218.67

√(218.67) = 14.79

So, the standard deviation is 14.79.

Now we are going to calculate the variance.

s 2 = 218.744

So, the variance is 218.744

Practice Problems on Descriptive Statistics

P1) Determine the sample variance of the following series: {17, 21, 52, 28, 26, 23}

P2) Determine the mean and mode of the following series: {21, 14, 56, 41, 18, 15, 18, 21, 15, 18}

P3) Find the median of the following series: {7, 24, 12, 8, 6, 23, 11}

P4) Find the standard deviation and variance of the following series: {17, 28, 42, 48, 36, 42, 20}

FAQs of Descriptive Statistics

What is meant by descriptive statistics.

Descriptive statistics seek to summarize, organize, and display data in an accessible manner while avoiding making sweeping generalizations about the whole population. It aids in discovering patterns, trends, and distributions within the collection.

How is the mean computed in descriptive statistics?

Mean is computed by adding together all of the values in the dataset and dividing them by the total number of observations. It measures the dataset’s central tendency or average value.

What role do measures of variability play in descriptive statistics?

Measures of variability, such as range, standard deviation, and variance, aid in quantifying the spread or dispersion of data points around the mean. They give insights on the dataset’s variety and consistency.

Can you explain the median in descriptive statistics?

The median is the midpoint value of a dataset whether sorted ascending or descending. It measures central tendency and is important when dealing with skewed data or outliers.

How can frequency distribution measurements contribute to descriptive statistics?

Measures of frequency distribution summarize the incidence of various values or categories within a dataset. They give insights into the distribution pattern of the data and are commonly represented by graphs or tables.

How are inferential statistics distinguished from descriptive statistics?

Inferential statistics use sample data to draw inferences or make predictions about a wider population, whereas descriptive statistics summarize aspects of known data. Descriptive statistics concentrate on the present dataset, whereas inferential statistics go beyond the observable data.

Why are descriptive statistics necessary in data analysis?

Descriptive statistics give researchers and analysts a clear and straightforward summary of the dataset, helping them to identify patterns, trends, and distributions. It aids in making educated judgements and gaining valuable insights from data.

What are the four types of descriptive statistics?

There are four major types of descriptive statistics: Measures of Frequency Measures of Central Tendency Measures of Dispersion or Variation Measures of Position

Which is an example of descriptive statistics?

Descriptive statistics examples include the study of mean, median, and mode.

Please Login to comment...

Similar reads.

  • School Learning
  • Math-Statistics

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

descriptive statistics in business research

DESCRIPTIVE STATISTICS IN BUSINESS RESEARCH

User Images

  • Research Scholar, Department of Management, Mizoram University, Mizoram, India.
  • Professor & Head, Department of Management, Mizoram University, Mizoram, India
  • Cite This Article as
  • Corresponding Author

The analysis of data is the most skilled task in the business research process which requires the researcher own judgment and skill. The different statistical techniques were available to enrich the researcher decision. Choices of appropriate statistical techniques were determined to a great extent by the research design, hypothesis and the kind of data that was collected. These techniques were categorized into descriptive and inferential statistics. This paper focuses only on descriptive statistics which summarizes large mass of data into understandable and meaningful form.

  • Descriptive statistics
  • Measures of Central Tendency
  • Measures of Dispersion
  • Measures of Asymmetry
  • Measures of Relationship

[ Carolyn Vanlalhriati and E Nixon Singh (2015); DESCRIPTIVE STATISTICS IN BUSINESS RESEARCH Int. J. of Adv. Res. 3 (Jun). 1409-1415] (ISSN 2320-5407). www.journalijar.com

Download Full Paper

Share this article.

Creative Commons License

descriptive statistics in business research

  East African Journal of Management and Business Studies Journal / East African Journal of Management and Business Studies / Vol. 4 No. 1 (2024) / Articles (function() { function async_load(){ var s = document.createElement('script'); s.type = 'text/javascript'; s.async = true; var theUrl = 'https://www.journalquality.info/journalquality/ratings/2408-www-ajol-info-eajmbs'; s.src = theUrl + ( theUrl.indexOf("?") >= 0 ? "&" : "?") + 'ref=' + encodeURIComponent(window.location.href); var embedder = document.getElementById('jpps-embedder-ajol-eajmbs'); embedder.parentNode.insertBefore(s, embedder); } if (window.attachEvent) window.attachEvent('onload', async_load); else window.addEventListener('load', async_load, false); })();  

Article sidebar.

Open Access

Article Details

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .

Main Article Content

Accounting information systems and their role in business decision-making processes in dodoma city, tanzania, adam aloyce semlambo.

ORCID logo

Stephen Lugaimukamu

Mussa ibrahim.

This study sought to report the perceptions of the accounting information systems and their role in business decision-making processes in Dodoma City, Tanzania, using the descriptive research design. The study focused on the population of 215 entrepreneurs among business organizations from whom 54 emerged as sample. A questionnaire and interview schedule collected data from the field. Data analysis involved descriptive statistics, in terms of frequencies and percentages as well as the thematic approach that addressed the qualitative data from the interviews. Based on the findings, the study concluded that AIS significantly enhanced decision-making by providing complete, reliable and timely financial information. Despite the high level of trust in the AIS, minor issues related to human error and occasional delays in data timeliness existed. To enhance the effectiveness of the Accounting Information Systems in the city, businesses need to implement rigorous data entry training to minimize human errors and improve the accuracy of the financial information.

AJOL is a Non Profit Organisation that cannot function without donations. AJOL and the millions of African and international researchers who rely on our free services are deeply grateful for your contribution. AJOL is annually audited and was also independently assessed in 2019 by E&Y.

Your donation is guaranteed to directly contribute to Africans sharing their research output with a global readership.

  • For annual AJOL Supporter contributions, please view our Supporters page.

Journal Identifiers

descriptive statistics in business research

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

descriptive statistics in business research

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Football analytics: assessing the correlation between workload, injury and performance of football players in the english premier league.

descriptive statistics in business research

1. Introduction

1.1. motivation, 1.2. background and context, 1.3. aims and objectives.

  • What is the correlation between payer workload, player characteristics, and performance of football players in the English Premier League?
  • How do game/match-related variables impact football players’ performance in the English Premier League, and what are the essential variables contributing towards the prediction of player performance?
  • How do game/match-related variables impact the occurrence of injuries among football players in the English Premier League, and what are the important variables contributing to injury prediction?
  • How do various factors influence injury occurrence among players?
  • Correlation Exploration : Conduct detailed statistical analyses, such as descriptive analytics and correlation analysis, to identify and quantify the complex correlations between player workload, individual traits, and performance measures.
  • Predictive Modelling : Develop and integrate advanced ML models to assess the prediction capacity of diverse variables. Determine the essential characteristics that have a major impact on the forecast of player performance (goals and assists), providing insights into the drivers of football success.
  • Injury Occurrence Analysis : Examine the impact of game and match-related variables on the occurrence of injuries among English Premier League football players. Use rigorous statistical tools to determine the key variables contributing to injury incidence.

2. Literature Review

2.1. introduction, 2.2. research and findings.

  • Statistical Metrics : Analytics involving the collection of statistical data such as goals scored, goals assisted, passes completed, etc., which reflects a player’s overall contribution to the team’s performance.
  • Position-Specific Analysis : Analytics estimating the effectiveness of various players across different areas of the playing field, allowing them to assess their strong and weak areas, which can highlight potential improvements.
  • Physical performance : Data related to player physical attributes, such as sprint speed, distance covered, and high-intensity runs, help gauge a player’s fitness level and work rate during matches.
  • Video Analysis : In addition to statistical data, video analysis is used to evaluate a player’s decision-making, movement, positioning, and technical skills during matches.
  • Injury Prevention : Understanding the relationship between workload and injury is crucial for developing effective injury prevention strategies. By identifying workload thresholds and patterns associated with higher injury risks, clubs and medical staff can implement targeted measures to reduce the likelihood of injuries.
  • Performance Optimization : The correlation between workload and performance is a critical aspect of player management. Balancing the right level of workload can positively impact player performance, ensuring optimal physical and technical abilities on the pitch.
  • Player Management : A deep understanding of the workload-injury-performance relationship allows for better player management.

3. Methodology

3.1. objective, 3.2. dataset.

  • Personal Information : Player Name, Club Name, Age, Position etc.
  • Individual Workload Features : Minutes Played, Matches Played, Matches Started, and 90 Minutes Played.
  • Individual Performance Features : Goals Scored, Assists, Goals Plus Assists etc.
  • Individual Injury Occurrences : Injured (Injured or Not Injured), Injury Reason, Injury Occurrences etc.

3.3. Methods and Structure

3.4. data preprocessing, 3.4.1. feature engineering.

  • Less than 10 ⇒ Rare starter
  • More than 10 but less than or equal to 25 ⇒ Average starter
  • More than 25 ⇒ Frequent starter
  • Less than 1140 min ⇒ Sporadic Player
  • Between 1140 and 2280 (inclusively) minutes ⇒ Squad Rotation Player
  • More than 2280 min ⇒ Crucial Player
  • 16 and 23⇒ Youngster
  • 24 and 31 ⇒ Prime
  • 32 and above ⇒ Veteran

3.4.2. Dummy Variables

3.4.3. train-test strategy, 3.4.4. sampling methods, 3.5. descriptive statistics, 3.6. correlation matrix, 3.7. machine learning (ml), 3.7.1. naive model, 3.7.2. decision tree, 3.7.3. random forests, 3.7.4. k-nearest neighbors (knn), 3.7.5. gradient boosting, 3.7.6. ridge regression, 3.7.7. xgboost, 3.8. hyperparameter tuning, 3.9. evaluation metric.

  • In the case of numerical ‘Goals and Assists’, RMSE is an ideal choice since it estimates the average size of forecast errors. It gives a comprehensive assessment of how well the ML model’s predictions match the actual numerical outcomes. Because RMSE prioritizes avoiding both overestimation and underestimation, it is appropriate for evaluating the prediction accuracy of models attempting to estimate continuous variables such as goals and assists.
  • Accuracy is a suitable choice for categorical ‘Injured’ and ‘Not Injured’ outcomes since it represents the proportion of accurately predicated cases. This metric is critical when it comes to appropriately classifying the occurrence or non-occurrence of an event. Given the significance of accurately detecting injuries, accuracy clearly indicates the model’s ability to categorize these occurrences.

4.1. Correlation Analysis

4.2. predicting goals and assists, 4.2.1. model performance assessment, residual analysis, training and testing assessment.

  • Decision Tree versus Random Forests : The Random Forest Regressor demonstrates superior initial performance, evidenced by its lower RMSE in both the training and testing sets, suggesting that it is a more effective model straight away when compared to the Decision Tree Regressor. Its smaller discrepancy between training and cross-validation RMSEs signals a reduced tendency toward overfitting, an advantage over the Decision Tree. For both models, adding more data improves the model’s performance, but the rate of improvement slows down, which is typical as a model starts to reach its performance limit with the given features and model complexity. Notably, the Random Forest Regressor shows less variability in its testing RMSE, which is illustrated by a narrower confidence interval, indicating a more consistent performance regardless of the training set it encountered.
  • Gradient Boosting versus Ridge Regression : The Gradient Boosting model has a considerable and continuous gap between training and testing RMSE. The training RMSE drops significantly at first, demonstrating the model’s ability to fit the training data effectively. However, the testing RMSE improves more slowly, constantly remaining higher than the training RMSE. This difference indicates overfitting, as the model struggles to generalize to new data. The Ridge Regression learning curve, on the other hand, begins with a significant gap between training and testing RMSE. The gap narrows dramatically as training progresses, and the two RMSE curves converge. The training RMSE gradually rises while the testing RMSE falls, indicating better generalization. Ridge Regression finds a better balance between fitting the training data and generalizing to new data based on this behavior. While the RMSE values are not the lowest among the models, the model’s ability to resist overfitting is a significant benefit.

4.2.2. Model Performance Comparison Based on RMSE

4.2.3. feature importance, 4.3. predicting injuries, 4.3.1. model evaluation.

  • Injury measurements are not relevant for the Naive Model because it uses goals/assists prediction as a baseline exclusively.
  • Since Gradient Boosting and Ridge Regression are suitable for managing continuous outcomes, they were only applied to objectives and assistance prediction. For these models, injury measurements are therefore irrelevant.
  • As XGBoost and K-Nearest Neighbors are appropriate at classification tasks, they were especially used for injury prediction. Consequently, RMSE values for objectives and aids are not relevant.
  • Decision-Tree and Random Forest can be comprehensive models, but they may sometimes underperform in any of metrics.

4.3.2. Feature Importance

5. discussion and findings, 5.1. predictors for player performance (goals and assists), 5.1.1. matches played.

  • Strong Positive Correlations with Workload Metrics : It became clear that player performance metrics, particularly goals and assists, exhibit strong positive correlations with workload metrics, including elements like minutes played, matches played, and 90s (minutes played divided by 90). This result is consistent with common sense because more time spent on the field naturally gives players more chances to assist their teammates and score goals.
  • Matches Played (MP) with emphasis : Within the subgroup of match load measures, matches played (MP) stood out as especially significant. Compared to other measures like minutes played (Min) and 90s (minutes played per match), which had correlations of 0.23, MP had a higher correlation coefficient of 0.34. This suggests that a player’s ability to score goals and provide assists is most significantly influenced by the sheer number of games in which they take part, regardless of how much time they spend on the field or whether they are a starter.

5.1.2. Squad

  • The prominence of Top Teams : Players from the season’s top four teams (Manchester City, Manchester United, Liverpool, and Leicester City) are marked in bold. This distinction is critical because it emphasizes that being a part of a high-performing team has a major impact on a player’s ability to accumulate goals and assists. In this case, these four teams account for half of the top 20 performers (10 out of 20, as emphasized in bold in Table 8 ), demonstrating their dominance in player performance metrics.
  • Variety of Positions : The table includes players from numerous positions, with forwards contributing the most goals and assists. Midfielders like Kevin De Bruyne, Bruno Fernandez, and Jack Harrison are also prominently featured in the top 20, highlighting their versatile responsibilities in both scoring and creating goals.
  • Individual Brilliance : The list features some of the league’s most prolific goal scorers and playmakers, including Harry Kane and Bruno Fernandes, at the top. These players are recognized for their extraordinary abilities and consistent impact on matches, routinely scoring goals and making assists.
  • Balance and Competitiveness : The presence of players such as Patrick Bamford and Ollie Watkins, who play for clubs other than the conventional top tier, demonstrates that the Premier League retains a competitive and diverse player environment. This type of balance brings interest to the league by allowing developing talent to flourish.

5.1.3. Age Category

5.2. predicting injuries, 5.2.1. squad, 5.2.2. average minutes per match, 6. conclusions, 6.1. key findings, 6.2. contribution to the field of sports science and analytical research, 6.3. impact on football clubs, 6.4. applicability to other domains, 6.5. limitations, 6.6. recommendations for future work, author contributions, informed consent statement, data availability statement, conflicts of interest.

  • Clemente, F.M.; Martins, F.M.L.; Mendes, R.S.; Figueiredo, A.J. A systemic overview of football game: The principles behind the game. J. Hum. Sport Exerc. 2015 , 9 , 656–667. [ Google Scholar ] [ CrossRef ]
  • Asif, R.; Zaheer, M.T.; Haque, S.I.; Hasan, M.A. Football (soccer) analytics: A case study on the availability and limitations of data for football analytics research. Int. J. Comput. Sci. Inf. Secur. 2016 , 14 , 516. [ Google Scholar ]
  • Chazan-Pantzalis, V.; Tjortjis, C. Sports Analytics for Football League Table and Player Performance Prediction. In Proceedings of the 2020 11th International Conference on Information, Intelligence, Systems and Applications, Piraeus, Greece, 15–17 July 2020; pp. 1–8. [ Google Scholar ]
  • Rodrigues, F.; Pinto, Â. Prediction of football match results with Machine Learning. Procedia Comput. Sci. 2022 , 204 , 463–470. [ Google Scholar ] [ CrossRef ]
  • Seidenschwarz, P.; Rumo, M.; Probst, L.; Schuldt, H. A Flexible Approach to Football Analytics: Assessment, Modeling and Implementation. In Proceedings of the 12th International Symposium on Computer Science in Sport (IACSS 2019) ; Springer International Publishing: Cham, Switzerland, 2020. [ Google Scholar ]
  • Windt, J.; Gabbett, T.J. How do training and competition workloads relate to injury? The workload-injury aetiology model. Br. J. Sports Med. 2017 , 51 , 428–435. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Cefis, M.; Carpita, M. Football Analytics: Performance analysis differentiate by role. In Third International Conference on Data Science & Social Research Book of Abstracts ; CIRPAS and University of Bari Aldo Moro: Bari, Italy, 2020; p. 22. [ Google Scholar ]
  • Javed, D.; Jhanjhi, N.Z.; Khan, N.A. Football Analytics for Goal Prediction to Assess Player Performance. In Proceedings of Innovation and Technology in Sports ; Springer Nature: Singapore, 2023; pp. 245–257. [ Google Scholar ]
  • Mead, J.; O’Hare, A.; McMenemy, P. Expected goals in football: Improving model performance and demonstrating value. PLoS ONE 2023 , 18 , e0282295. [ Google Scholar ] [ CrossRef ]
  • Baboota, R.; Kaur, H. Predictive analysis and modelling football results using machine learning approach for English Premier League. Int. J. Forecast. 2019 , 35 , 741–755. [ Google Scholar ] [ CrossRef ]
  • Gronwald, T.; Klein, C.; Hoenig, T.; Pietzonka, M.; Bloch, H.; Edouard, P.; Hollander, K. Hamstring injury patterns in professional male football (soccer): A systematic video analysis of 52 cases. Br. J. Sports Med. 2021 , 56 , 165–171. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Howle, K.; Waterson, A.; Duffield, R. Injury Incidence and Workloads during congested Schedules in Football. Int. J. Sports Med. 2019 , 41 , 75–81. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sarlis, V.; Tjortjis, C. Sports Analytics: Data Mining to Uncover NBA Player Position, Age, and Injury Impact on Performance and Economics. Information 2024 , 15 , 242. [ Google Scholar ] [ CrossRef ]
  • Alayón, S.; Hernández, J.; Fumero, F.J.; Sigut, J.F.; Díaz-Alemán, T. Comparison of the Performance of Convolutional Neural Networks and Vision Transformer-Based Systems for Automated Glaucoma Detection with Eye Fundus Images. Appl. Sci. 2023 , 13 , 12722. [ Google Scholar ] [ CrossRef ]
  • Xu, M.; Watanachaturaporn, P.; Varshney, P.K.; Arora, M.K. Decision tree regression for soft classification of remote sensing data. Remote Sens. Environ. 2005 , 97 , 322–336. [ Google Scholar ] [ CrossRef ]
  • Liaw, A.; Wiener, M. Classification and Regression by Randomforest. R News 2002 , 2 , 18–22. [ Google Scholar ]
  • Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN Model-Based Approach in Classification ; Springer: Cham, Switzerland, 2003; pp. 986–996. [ Google Scholar ]
  • Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002 , 38 , 367–378. [ Google Scholar ] [ CrossRef ]
  • de Vlaming, R.; Groenen, P.J. The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics. Biomed. Res. Int. 2015 , 2015 , 143712. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Abdurrahman, G.; Sintawati, M. Implementation of xgboost for classification of parkinson’s disease. J. Phys. Conf. Ser. 2020 , 1538 , 012024. [ Google Scholar ] [ CrossRef ]
  • Belete, D.M.; Huchaiah, M.D. Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. Int. J. Comput. Appl. 2022 , 44 , 875–886. [ Google Scholar ] [ CrossRef ]
  • McKeown, G. To Build a Top Performing Team, Ask for 85% Effort. Available online: https://hbr.org/2023/06/to-build-a-top-performing-team-ask-for-85-effort (accessed on 14 August 2024).

Click here to enlarge figure

Study TitleAuthor(s)CategoryAimYear
How do training and competition workloads relate to injury?Windt and Gabbett [ ]Workload, Injuries and Performance in FootballPresent a new framework for managing football injuries2017
Injury Incidence and Workloads during congested Schedules in FootballHowle et al. [ ]Workload, Injuries and Performance in FootballExamine the relationship between injury incidence and workloads in congested football schedules2019
Predictive analysis and modelling football results using machine learning approach for English Premier LeagueBaboota & Kaur [ ]Performance Analysis in FootballUne machine learning for predictive analysis of football match outcomes2019
Football Analytics: Performance analysis differentiate by roleCefis and Carpita [ ]Performance Analysis in FootballDevelop composite indices for performance assessment in football2020
Hamstring injury patterns in professional male footballGronwald et al. [ ]Workload, Injuries, and Performance in FootballIdentify factors contributing to acute hamstring injuries in professional male football players2021
Expected goals in football: Improving model performance and demonstrating valueMead et al. [ ]Performance Analysis in FootballImprove Expected Goals (xG) modeling for assessing team success in football2023
Football Analytics for Goal Prediction to Assess Player PerformanceJaved et al. [ ]Performance Analysis in FootballGreate a Goal Prediction Model and assess player performance in football2023
ModelHyper ParameterRMSE
Naive Model 6.62
Decision Tree (DC_1)Default4.77
Decision Tree (DC_2)Max depth: None
Min samples leaf: 4
Min samples split: 2
4.41
Random Forest (RF_1)Default4.33
Random Forest (RF_2)Max depth: None
Min samples leaf: 2
Min samples split: 5
N estimators: 50
4.24
Gradient Boosting (GB_1)Default4.13
Gradient Boosting (GB_2)Learning rate: 0.1
Max depth: 3
Min samples leaf: 4
Min samples split: 10
N estimators: 50
4.05
Ridge Regression (RR_1)Default3.91
Feature ImportanceDecision TreeRandom ForestGradient Boosting
Minutes Played (Min)0.0240.3970.305
Matches Played (MP)0.0220.0470.054
Starts0.073 0.058
90 Minutes Played (90s) 0.114
Age0.0060.0300.032
ModelAccuracyPrecisionRecallAUC Score
0.580.350.66
Random Forest (RFC_2)0.690.500.500.70
0.550.600.65
K-Nearest Neighbors (KNN_2)0.630.4050.500.54
Model (Baseline)RMSE (Goals/Assists)Accuracy (Injury)Precision (Injury)Recall (Injury)AUC (Injury)
Naive Model6.62NANANANA
Gradient Boosting4.05NANANANA
Ridge Regression3.90NANANANA
Model (Injury investigations)RMSE (Goals/Assists)Accuracy (Injury)Precision (Injury)Recall (Injury)AUC (Injury)
XGBoostNA0.720.550.600.65
K-Nearest NeighborsNA0.630.4050.500.54
Model (Comprehensive)RMSE (Goals/Assists)Accuracy (Injury)Precision (Injury)Recall (Injury)AUC (Injury)
Decision Tree4.410.720.580.350.66
Random Forest4.240.690.500.500.70
FeatureRandom Forest (RFC_2)Decision Tree (DTC_2)XGBoost (XGB_2)
Position0.060.110.03
Squad
Age0.110.120.20
Matches Played0.120.080.12
Starts0.100.040.11
Minutes Played0.120.040.08
Average Minutes/Match0.120.250.11
90s 0.160.090.07
Player Workload0.000.000.00
Player Usage0.010.000.01
Age Category0.030.000.01
Player2020/212021/222022/23
Matches PlayedGls_AstMatches PlayedGls_AstMatches PlayedGbs_Ast
Harry Kane353737263833
Mohammed Salah372735363831
Bruno Fernandes373036163716
PlayerGoals + AssistsTeamPosition
Harry Kane37TottenhamForward
30Manchester UnitedMidfielder
Son Heung-min27TottenhamForward
27LiverpoolForward
24Leicester CityForward
Patrick Bamford24Leeds UnitedForward
20Manchester UnitedForward
Ollie Watkins19Aston VillaForward
18Manchester CityMidfielder
18LiverpoolForward
Matheus Pereira17West BromForward
17Manchester CityForward
Callum Wilson17Newcastle UtdForward
16Manchester CityForward
16LiverpoolForward
Jack Harrison16Leeds UnitedMidfielder
Danny Ings16BurnleyForward
Dominic Calvert-Lewin16EvertonForward
15Manchester CityForward
Chris Wood15BurnleyForward
PlayerSquadAge
Harry KaneTottenham27
Mohammed SalahLiverpool28
Bruno FernandesManchester United26
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Chang, V.; Sajeev, S.; Xu, Q.A.; Tan, M.; Wang, H. Football Analytics: Assessing the Correlation between Workload, Injury and Performance of Football Players in the English Premier League. Appl. Sci. 2024 , 14 , 7217. https://doi.org/10.3390/app14167217

Chang V, Sajeev S, Xu QA, Tan M, Wang H. Football Analytics: Assessing the Correlation between Workload, Injury and Performance of Football Players in the English Premier League. Applied Sciences . 2024; 14(16):7217. https://doi.org/10.3390/app14167217

Chang, Victor, Sreeram Sajeev, Qianwen Ariel Xu, Mengmeng Tan, and Hai Wang. 2024. "Football Analytics: Assessing the Correlation between Workload, Injury and Performance of Football Players in the English Premier League" Applied Sciences 14, no. 16: 7217. https://doi.org/10.3390/app14167217

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Descriptive Statistics: Definition, Overview, Types, and Examples

    descriptive statistics in business research

  2. 4 SAS/STAT Descriptive Statistics Procedure You Must Know

    descriptive statistics in business research

  3. Standard statistical tools in research and data analysis

    descriptive statistics in business research

  4. Introduction to Descriptive Statistics

    descriptive statistics in business research

  5. Descriptive Statistics

    descriptive statistics in business research

  6. Descriptive Statistics

    descriptive statistics in business research

COMMENTS

  1. What Is Descriptive Statistics: Full Explainer With Examples

    Descriptive statistics, although relatively simple, are a critically important part of any quantitative data analysis. Measures of central tendency include the mean (average), median and mode. Skewness indicates whether a dataset leans to one side or another. Measures of dispersion include the range, variance and standard deviation.

  2. Descriptive Statistics

    Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population. ... In quantitative research, after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable ...

  3. Descriptive Statistics in Research: Your Complete Guide- Qualtrics

    Descriptive statistics in research: a critical component of data analysis . 15 min read ... business challenges, income and so on. Measure data trends. Let's say you wanted to assess propensity to buy over several months or years for a specific target market and product. With descriptive statistics, you could quickly summarize the data and ...

  4. (PDF) Introduction to Descriptive statistics

    Similarly, De scriptive statistics are used to summarize and analyze data in. a variety of academic areas, including psychology, sociology, economics, education, and epidemiology [3 ]. Descriptive ...

  5. Descriptive Statistics

    Presenting Research Findings: Descriptive statistics can be used to present research findings in a clear and understandable way, often using visual aids like graphs or charts. Monitoring and Quality Control : In fields like business or manufacturing, descriptive statistics are often used to monitor processes, track performance over time, and ...

  6. Descriptive Statistics: Definition, Overview, Types, and Examples

    Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of it. Descriptive statistics ...

  7. What is Descriptive Statistics? Definition, Types, Examples

    Data Visualization: Descriptive statistics often accompany visual representations, such as histograms, box plots, and bar charts, making it easier to interpret and communicate data trends and distributions. Data Exploration: They facilitate the exploration of data to identify outliers, patterns, and potential areas of interest or concern.

  8. Descriptive Statistics: Definitions, Types, Examples

    It involves organizing, visualizing, and summarizing raw data to create a coherent picture. The primary goal of descriptive statistics is to provide a clear and concise overview of the data's main features. This helps us identify patterns, trends, and characteristics within the data set without making broader inferences.

  9. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  10. Descriptive statistics

    Research. A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, [ 1] while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential ...

  11. BUSINESS STATISTICS: Introduction and Descriptive Statistics

    Descriptive statistics is a part of business statistics that not only processes, presents data without making decisions for participation, but generally describes the data obtained. It refers to ...

  12. Descriptive Statistics

    The term "descriptive statistics" refers to the analysis, summary, and presentation of findings related to a data set derived from a sample or entire population. Descriptive statistics comprises three main categories - Frequency Distribution, Measures of Central Tendency, and Measures of Variability. Descriptive statistics helps ...

  13. What Is Descriptive Statistics?

    There are two branches of statistics. Descriptive Statistics: Descriptive statistics is a statistical measure that describes data. Inferential Statistics: You practice inferential statistics when you use a random sample of data taken from a population to describe and make inferences about the population.

  14. What Is Descriptive Analytics? 5 Examples

    5 Examples of Descriptive Analytics. 1. Traffic and Engagement Reports. One example of descriptive analytics is reporting. If your organization tracks engagement in the form of social media analytics or web traffic, you're already using descriptive analytics. These reports are created by taking raw data—generated when users interact with ...

  15. Descriptive Statistics

    Descriptive statistics summarise and organise characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population . In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e ...

  16. The Importance of Statistics in Business (With Examples)

    In a business setting, statistics is important for the following reasons: Reason 1: Statistics allows a business to understand consumer behavior better using descriptive statistics. Reason 2: Statistics allows a business to spot trends using data visualization. Reason 3: Statistics allows a business to understand the relationship between ...

  17. Descriptive Statistics

    One simple graph, the stem-and-leaf graph or stem plot, comes from the field of exploratory data analysis.It is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem and a leaf. The leaf consists of a final significant digit. For example, 23 has stem 2 and leaf 3.

  18. Descriptive Statistics

    Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics are typically distinguished from inferential statistics.

  19. Descriptive vs. Inferential Statistics: What's the Difference?

    Example of Using Descriptive Statistics. The following example illustrates how we might use descriptive statistics in the real world. Suppose 1,000 students at a certain school all take the same test. We are interested in understanding the distribution of test scores, so we use the following descriptive statistics: 1. Summary Statistics. Mean ...

  20. How Descriptive Statistics Helps You Make Sense of Your Data

    How to calculate descriptive statistics in Excel in 3 simple steps. Let's say we have a dataset with 10 values entered into a single column on a Microsoft Excel spreadsheet. Step 1: Click on the 'Data' tab. Select 'Data Analysis' in the Analysis group. Step 2: Click on 'Descriptive Statistics'.

  21. Descriptive Statistics: A Detailed Explanation

    Descriptive Statistics and inferential statistics serve different purposes in data analysis. Descriptive Statistics focus on summarising and describing the main features of a dataset. They provide simple summaries and visualisations, such as mean, median, mode, range, and standard deviation, to help understand the basic characteristics of the data.

  22. Descriptive Statistics In Research

    Descriptive statistics in research: a critical component of data analysis . 15 min read ... business challenges, income and so on. Measure data trends; Let's say you wanted to assess propensity to buy over several months or years for a specific target market and product. With descriptive statistics, you could quickly summarise the data and ...

  23. Descriptive Statistics: Definition, Formulas, Types, Examples

    Business and Economics: Descriptive statistics are useful for analyzing sales data, market trends, and customer behaviour. They are used to generate averages, medians, and standard deviations in order to better evaluate product performance, pricing strategies, and financial metrics. ... Market Research: Descriptive statistics are used to ...

  24. Descriptive Statistics in Business Research

    DESCRIPTIVE STATISTICS IN BUSINESS RESEARCH. Research Scholar, Department of Management, Mizoram University, Mizoram, India. The analysis of data is the most skilled task in the business research process which requires the researcher own judgment and skill. The different statistical techniques were available to enrich the researcher decision.

  25. East African Journal of Management and Business Studies

    This study sought to report the perceptions of the accounting information systems and their role in business decision-making processes in Dodoma City, Tanzania, using the descriptive research design. The study focused on the population of 215 entrepreneurs among business organizations from whom 54 emerged as sample. A questionnaire and interview schedule collected data from the field.

  26. Applied Sciences

    The aim of this research is to shed light on the complex interactions between player workload, traits, match-related factors, football performance, and injuries in the English Premier League. Using a range of statistical and machine learning techniques, this study analyzed a comprehensive dataset that included variables such as player workload, personal traits, and match statistics. The ...