Summary Statistics

What is the meaning of summary statistics?

Summary statistics (or summary metrics) define a complicated set of data (or whole population) with some simple metrics. Basically, summary statistics summarize large amounts of data by describing key characteristics such as the average, distribution, potential correlation or dependence, etc.

Examples of summary statistics

Summary statistics usually fall into several broad categories: location, shape, spread, dependence, and order statistics. We’ll look at examples in each one.

Summary statistics that measure:

  • Average (or central tendency) - Where is the data centered? Where is the trend? Examples include mode, median, and mean.
  • Shape - How is the data distributed? What is the pattern? How is the data skewed? Examples include skewness or kurtosis and L-moments.
  • Spread - How varied or dispersed is the data? Examples include range, variance, and standard deviation (among others).
  • Dependence - If the data contains more than one variable, are the variables correlated or dependent? The primary example is correlation coefficient.

Which summary statistics to use

Deciding which summary statistics to use depends on which questions you need answered and/or the problem you’re trying to solve. Often before even looking at the data, it’s useful to articulate exactly what your goal or problem is. Check out this no-nonsense data analysis guide for help defining your problem.

There’s also another set of summary statistics - order statistics - that combines several of the above-mentioned metrics (e.g. average, shape, spread, etc.). The two most common order statistics are the five number summary and slightly expanded seven number summary. As the titles indicate, they include 5 and 7 specific numbers (respectively) that help define the entire data set. The advantage of using either of these order statistics is that you don’t have to decide which numbers (e.g. mode, median, skewness, L-moments, variance, etc.) to include in the summary because they’re already defined.

Danger of summary statistics

It’s helpful to visualize summary statistics on a graph because sometimes data sets can have the same summary metrics and yet be visually distinct as illustrated here.

To demonstrate the effect of this data fallacy, statistician Francis Anscombe put together four example data sets in the 1970s. Known as Anscombe’s Quartet, each data set has the same mean, variance, and correlation. However, when graphed, it’s clear that each of the data sets are totally different. The point that Anscombe wanted to make is that the shape of the data is as important as the summary metrics and cannot be ignored in analysis.

Additional resources to learn more about summary statistics