Data science glossary

Does complex terminology put you off using data? Here’s a glossary of common data science terms with straightforward definitions, helpful examples, and additional resources if you want to dive deeper.

What is an algorithm?

An algorithm is defined as a specified process for solving a problem - often written by a human and performed by a computer. It’s like a recipe with exact steps that a computer follows to produce the same result every time - such as alphabetizing an event attendee list by last name.

Algorithms are especially useful for performing repetitive and complex calculations, processing large amounts of data, and completing automated reasoning tasks. Algorithms make processes more efficient.

Types of Algorithms

There are many different types of algorithms (it could easily become a rabbit hole of indefinite learning!), but here are a few of the more common types you might hear or encounter.

Algorithm Examples

Algorithms are nearly everywhere in the digital world - from Google’s page rank to recommendations in Netflix to ecommerce checkout pages. In each instance, these algorithms are crunching data to deliver relevance to you.

If you use a spreadsheet to combine, separate, or otherwise order data, you’re using an algorithm to complete those tasks. As the names imply, Merge Sort, Quick Sort, and Heap Sort are sorting algorithms used to arrange and rearrange data in various ways to provide more useful insights.

Link Analysis is a type of graph algorithm that maps the relationship between data points. A prime example of what Link Analysis can produce is your Facebook Newsfeed. Link Analysis works well for discovering new, related content, matches in data for known patterns, and anomalies where known patterns are violated.

Additional Resources to Learn More About Algorithms

What is data analytics?

A broad definition of analytics is the review of data to discover, understand, and communicate meaningful patterns. Or more simply, analytics is the useful insights from raw data. Analytics can refer to program or product specific insights such as Google Analytics, Facebook Analytics, Twitter Analytics, etc.

But there are four broad types of analytics that can be applied to generally any data you’re working with. These types are descriptive, diagnostic, predictive, and prescriptive analytics. (The most common are descriptive, predictive, and prescriptive.) Each one builds on the previous one.

What is the difference between descriptive, diagnostic, predictive, and prescriptive analytics?

Descriptive analytics defined

Descriptive analytics, as the name implies, describes what has happened in the past, which is sometimes referred to as ‘historical data.’ Descriptive analytics answers the question “What has happened?”

Descriptive analytics examples

A vast number of analytics fall into this category. Examples of descriptive analytics include number of sales last year, variance in churn rate month over month, average revenue per customer, etc. Basically, any instance where raw data from the past (1 minute or 1 year ago) is summarized can be classified as descriptive analytics.

Diagnostic analytics defined

Diagnostic analytics focuses on the underlying cause and is less common than the other three types of analytics. It drills down to a single issue or problem in isolation. Diagnostic analytics answers the question “Why is this happening?”

Diagnostic analytics examples

Some diagnostic analytics examples include a marketing manager reviewing a campaign performance of different geographical regions, a sales director analyzing the number of sales for each product, or a customer success team looking at the response time of customers who churn.

Predictive analytics defined

Predictive analytics is a calculated guess of what may happen in the future based on past performance. While it can’t actually predict the future, it creates a forecast using algorithms that factor in past performance (descriptive analytics) and other possible variables. Predictive models create a picture of the future that can then be used to make more informed, data-backed decisions (see prescriptive analytics below). Predictive analytics answers the question “What is likely to happen?”

Predictive analytics examples

Some common examples of predictive analytics are budgets that forecast expenses for the upcoming year, a credit score that estimates the likelihood of someone making on-time payments in the future, and revenue predictions that help executives understand how much profit might be possible for the business.

Prescriptive analytics defined

Prescriptive analytics takes what’s likely to happen (predictive analytics) and suggests strategies or actions moving forward. It’s advice for achieving a possible outcome. Prescriptive analytics answers the question “What should we do?”

Prescriptive analytics examples

Prescriptive analytics examples include optimizing supply chain management based on demand trends (predictive analytics), suggesting the fastest route home based on driving conditions, or planning employee work shifts based on the busiest times for a restaurant or retailer.

What’s the difference between analytics and analysis?

There’s an interesting distinction between analytics and analysis. Analytics focuses on the entire methodology (i.e. the tools and techniques) for obtaining useful insights from data. Data analysis is a subset of that methodology focused on compiling and reviewing data to aid in decision making.

Additional resources to learn more about descriptive, diagnostic, predictive, and prescriptive analytics

What is bias in data analytics?

In general, bias can be defined as an inclination towards a thing, group, or person, often in a way that is inaccurate or unfair. Bias in statistics can affect your insights, leading to poor, and potentially, costly business decisions.

While there are many types of bias, here are some common biases in data analysis you’ll want to look out for.

  • Anchoring Bias: Relying heavily on the first piece of data encountered to make a decision (treating subsequent information as less significant).
  • Publication Bias: How interesting a research finding is, affects how likely it is to be published, distorting our impression of reality.
  • Sampling Bias: Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand. This is a type of Selection Bias.
  • Confirmation Bias: Looking for information that confirms what you already think or believe. This is similar to another common data fallacy - Cherry Picking.
  • Survivorship Bias: Drawing conclusions from an incomplete set of data, because that data has ‘survived’ some selection criteria.
  • Funding Bias (sometimes called Sponsorship Bias): Favoring the interests of the person or group funding the research or analysis. Data analytics might be selected or ignored to validate a predetermined conclusion.
  • Observation Bias (also known as the Hawthorne Effect): When the act of monitoring someone can affect that person’s behavior.

Biased data examples

Understanding the biases that can impact data analysis is a great first step to recognizing them when they pop up in your own data analysis. Here are some common examples of statistical bias you may notice as you work with data.

In salary negotiations, applicants and hiring managers alike might rely on the first salary rate mentioned as the basis for a reasonable range. This anchoring ignores other rates that might also be “reasonable” based on location, experience, job description, etc.

A customer success manager might want to understand why churn has been increasing month over month and has a hunch it’s because of a product feature that adds frustration to the user experience. When reviewing exit surveys, the customer success manager notices this feature is mentioned many times, but ignores that average ticket response rate is mentioned equally as often. This example of confirmation bias shows how our preconceived understanding of the data distorts reality.

A marketing team wants to know what channel their audience prefers to consume content. They conduct a Twitter poll since that’s their most responsive channel, not realizing that a significant portion of their audience (on Quora and LinkedIn) never saw the poll. While this example of sampling bias might seem obvious, it’s easy to focus on gathering data quickly and miss this common tendency.

Additional resources to learn more about bias in data analysis

What is correlation?

A simple definition of correlation is the relationship between two or more variables (or data sets). This relationship is defined more specifically by its strength and direction.

What is strong correlation in statistics?

A strong correlation (sometimes referred to as high correlation) is when two groups of data are very closely related. The inverse is also true - weak (or low) correlation means the two groups of data are only somewhat related.

For example, increasing ice cream sales have a strong (high) correlation with rising temperatures. The hotter it is, the more ice cream people eat.

What is positive and negative correlation?

The direction of related variables can be either positive or negative. Positive correlation means both values increase together. Negative correlation means one value increases while the other value decreases.

Continuing with the ice cream example, higher ice cream sales have a strong positive correlation with warmer temperatures because as the weather gets hotter (increasing value) more ice cream is sold (increasing value).

One negative correlation might be that less hot chocolate is sold (decreasing value) when the temperature gets warmer (increasing value). Or consider another example of negative correlation - the more someone pays on their mortgage (increasing value), the less they owe (decreasing value).

(The more technical and complex term for describing both the strength and direction of a correlation is called correlation coefficient).

Correlation vs causation: what’s the difference?

If you’ve heard the mantra “Correlation does not imply causation” you might have wondered - what’s the difference between correlation and causation and why does it matter?

As described above, correlation indicates a relationship between two variables and is particularly helpful for making predictions. For example, if we know that SAT (Scholastic Assessment Test) scores have a strong positive correlation with a student’s grade point average (GPA) in college, we can assume both SAT scores and GPA will continue having a strong positive correlation in the future. So based on a student’s SAT scores in high school, we can predict what their GPA might be in college.

In contrast, causation refers to cause and effect - where one variable causes the other variable. To use an obvious example, we might notice that global temperatures have steadily risen over the past 150 years and the number of pirates has declined at a comparable rate (negative correlation). No one would reasonably claim that the reduction in pirates caused global warming or that more pirates would reverse it. But if we look at other contributing factors, we see the cause of both is industrialization.

It’s easy to assume that because two events happen at the same time (correlate), one must cause the other. This data fallacy is called False Causality. Remember that correlation alone does not prove a cause and effect relationship.

Additional resources to learn more about correlation

What is data processing?

A simple definition of data processing is a sequence for collecting raw data and turning it into meaningful information. This sequence is usually done by a computer allowing us to quickly gain insight from large amounts of data.

Data processing steps

There are several steps (also functions or tasks) that may be used to process data depending on what raw data you have and what you need to know from it.

  • Validation: ensures data is relevant and correct.
  • Sorting: organizes data into a sequence and/or in particular sets.
  • Aggregation: brings multiple pieces of data together
  • Summarization: reduces detailed data into key points
  • Analysis: discovers, interprets, and communicates meaningful patterns
  • Reporting: presents data
  • Classification: divides data into different groups

Types of data processing

A computer can process data in different ways depending on amount of data, time requirements, computing power and availability, etc. Some of these types of data processing can become quite technical, but here’s a simple overview.

  • Batch processing: As the name implies, this type of data processing takes a chunk of data (in sequential order), processes it, and once it’s all complete, returns the insights for that chunk of data. This type helps reduce processing costs for large amounts of data.
  • Real-time or online processing: Probably the most familiar, this type of processing simultaneously receives and processes the data, providing immediate results. This requires an internet connection and the data is stored online.
  • Multiprocessing: This efficiently utilizes two or more independent computer brains (technical term is central processing unit or CPU) to process data simultaneously. The processing tasks are divided between whichever CPUs are currently available to reduce processing time and maximize throughput.
  • Time sharing: This type is where several uses rely on a single CPU to process data. Users share the processing time and therefore have allocated time slots for processing data. Sometimes this type is called multi-access system.

Examples of data processing

Examples of data processing can be seen in many common activities - some we may take for granted such as paying with a credit card or taking a picture on a phone. For the latter, the camera lens captures the raw data (color, light, etc.) and converts it into a photo file that can be easily edited, shared, or printed.

Transactions - whether paying by credit card or transferring money internationally - also require data processing to collect, verify, and format the payment credentials before the bank or other financial institution can accept it.

Another example is the automated billing for a Software-as-a-Service (SaaS) business. The computer summarizes the charges for each customer’s service plan and converts the charges to a monthly or annual invoice that is billed automatically.

Self-driving vehicles are a somewhat less common example, but illustrate a significant array of processed data. The sensors around the car provide tons of raw data about navigation, other vehicles/people, driving conditions, color of red lights, street signage, and more. All of that data is then processed in real-time to dictate when to go, stop, turn, change lanes, accelerate, signal, etc.

Additional resources to learn more about data processing

What is a data set?

A data set is a technical term that simply means a collection of data. Typically, a data set refers to the content of a single table or graph. More specifically, a data set contains a single time series such as number of customer service tickets resolved daily.

This term can be a little confusing since some people use it more generally as a reference to related tables. A more precise term for related tables is data collections (see below).

Data set examples

A data set can be anything from the trend of trial signups over the last month to the geographic location of customers to the value of bitcoin year-over-year. Usually, datasets contain a single time series. For example, you might have a dataset with the number of sales for every day this week or a dataset with monthly revenue churn.

What’s the difference between datasets, databases, and data collections?

Datasets refer to data with a single time series.

Databases are made of data on a particular topic from a single publisher and may contain many datasets. (Some people may use databases more loosely to refer to a group of datasets in one location, even if the datasets are compiled from different sources.)

Data collections are made up of related datasets or databases on a single topic.

Big data and open data

For additional context, when the number of datasets exceeds the capacity of normal data processing applications, it’s called big data.

Datasets that are aggregated and then shared in a public repository refer to open data.

Additional resources to learn more about datasets

What is a hypothesis?

A hypothesis is an educated guess (often about the cause of a problem) that hasn’t been confirmed yet. Think of it as a possible explanation that needs to be tested.

A scientific hypothesis refers to a hypothesis that will be proven using the scientific method (a series of steps to investigate a claim using measurable evidence). A key element of a scientific hypothesis is that it can be proven wrong (meaning it’s falsifiable).

How do you set up a hypothesis?

There are a few characteristics of a good hypothesis:

  • it involves an independent variable and a dependent variable
  • it’s testable
  • it’s falsifiable

The independent variable is the cause (the aspect can be changed or controlled) and the dependent variable is the effect (the testable outcome).

A hypothesis is usually written as a statement describing a possible explanation that connects those two variables (see examples below).

A helpful way of ensuring you have a falsifiable hypothesis is to drop your variables in this question: “If [independent variable/cause] occurs, will [dependent variable/effect] be true or false?”

What is an example of a hypothesis?

Some examples of hypothesis might be:

  • A simplified form generates more trial signups than a detailed form
  • Pricing package A is more popular with customers than pricing package B
  • The holidays negatively impact our weekly average website sessions

What is a null hypothesis?

There are several different types of hypothesis (e.g. simple, complex, statistical, empirical, etc.), but an important type to know is a null hypothesis. This type of hypothesis states that there is no significant relationship between the two variables. The symbol for a null hypothesis is HO.

Basically, a null hypothesis claims the opposite of a typical hypothesis. The purpose of a null hypothesis is to allow the experiment results to contradict the hypothesis, thus proving that there is in fact a relationship between the two variables.

Null hypothesis example

The following null hypotheses are the inverse of the hypotheses mentioned above.

  • On average, there is no difference in the amount of signups generated between the simplified form and the detailed form
  • Pricing packages A and B are equally popular with customers
  • The holidays make no significant impact on our weekly average website sessions

Additional resources for learning more about hypotheses

What is margin of error?

A simplified definition of margin of error is the amount the results of a random sampling may differ from the results of surveying the whole.

In many cases, it doesn’t make sense (or it’s not possible) to survey an entire group, so a random sample is chosen instead. The margin of error states to what degree the sample results accurately represent the whole.

The smaller the margin of error, the more accurate the sample results are. The larger the margin of error, the less accurate the sample results may be.

What is a good margin of error?

A ‘good’ margin of error depends on the level of accuracy you need. While a 5% margin of error is fairly common, it can fall anywhere from 1% to 10%. Anything over 10% is not recommended.

The margin of error can typically be increased or decreased by adjusting the sample size of your survey.

Margin of error, confidence level, and confidence interval

Other closely related terms that might be mentioned in the context of margin of error is the ‘confidence level’ and ‘confidence interval’ of the survey. These terms are easy to get confused, so let’s break it down.

The confidence level is usually articulated as a percent - such as 95% - and states the degree of reliability for a randomized sample survey. It answers the question “how likely is it that I can repeat this survey and get the same results?”

The confidence level can range from 0% (zero confidence in the repeatability of the survey results) up to 100% (although it’s statistically impossible to ever be 100% confident). The higher the confidence level, the more reliable the survey results are.

The margin of error focuses on the range of possible error above or below the result of the survey.

The confidence interval is simply the maximum range of the margin of error. Since the margin of error can be above or below the survey results, the confidence interval is double the margin of error.

For example, if the result of a random sample shows 60% of customers are very satisfied with your service and you have a margin of error of 3%, you can expect that between 57% and 63% (confidence interval) of all your customers are very satisfied. The confidence level - for this example let’s say it’s 95% - tells you that 95% of the time you’ll get results that fall within 57-63% (confidence interval).

Additional resources for learning more about margin of error

What is a multivariate test?

A multivariate test is a process where variations of different elements can be evaluated simultaneously. Multivariate testing allows you to determine which combination of variations performs best.

Example of a multivariate test

One common use of multivariate testing is evaluating which variations of a website perform best. For example, perhaps you want to increase sign ups on a specific webpage. You could test two different titles, two different images, and two different calls to action.

A total of eight different versions (the maximum combination for these three elements) would be tested simultaneously to determine which version of the webpage produces the most sign ups.

How to calculate the variations in a multivariate test

Calculating the total number of variations in a multivariate test is a straightforward equation.

[# variations for element A] X [# variations for element B]… = # total possible variations

Using the multivariate test example above, the calculation would be 2 X 2 X 2 = 8.

Multivariate testing vs A/B testing

Multivariate testing allows you to see which variation of different elements perform best together. This can also be called multi-variable testing.

A/B testing compares just two variations - whether overall performance or single elements. For example, you might test a green call to action on the page [test A] against a red call to action on the same page [test B] to see which color gets the most clicks.

Another option with A/B testing is to compare two drastically different pages against each other. Even though there may be many different elements on the two pages, an A/B test only shows the overall performance of each page - not the individual elements. It’s worth noting that additional variations can be tested (i.e. A/B/C testing), but they still only compare the overall performance of each page - unlike multivariate which shows the relationships between varied elements.

Pros and cons of multivariate testing

Multivariate testing is an efficient way of evaluating different elements and possible combinations. This process can save valuable time that otherwise would have been spent on many iterations of A/B testing.

The primary limitation of multivariate testing is the high traffic requirement. The traffic will be evenly divided between all possible variations. So if you have eight possible variations, your traffic will be divided into eighths. The danger is when a webpage doesn’t receive high enough traffic to produce reliable results (i.e. statistical significance).

Also, multivariate testing isn’t applicable for certain types of change. For example, testing all the elements of a rebranded homepage design with the existing design wouldn’t make sense because of the radical variations between them.

In contrast, A/B tests compare the overall performance of each (radical) variation. A/B tests also allocate 50% of traffic to each variation since there are usually only two variations tested at the same time (unless performing A/B/C testing as mentioned above).

Additional resources for learning more about multivariate testing

What is an outlier?

An outlier is defined as a piece of data that is distant from the remaining set of data. Think of it as a straggler. There are a few reasons why you may encounter outliers in your data. They might be caused by a measurement error, they might be evidence of an abnormal distribution of data, or perhaps they indicate a smaller subset of the data.

An outlier may be found using different statistical methods including standard deviation, Peirce’s criterion, and other advanced methods. Many of these can be performed using formulas in a spreadsheet or online calculators (see Additional Resources below for links).

Why is it important to identify outliers in statistics?

It’s important to be aware of outliers in your data because they can skew your analysis, leading to inaccurate or misleading reporting and perhaps poor decisions. One of the most pronounced distortions is when an outlier throws off the mean (average) of the data.

For example, if your customer service reply times for the past 10 tickets are 22, 18, 21, 27, 26, 23, 25, 134, 22, and 23 minutes, your average reply time will likely report 34 minutes. By removing the outlier (134), the average reply time drops to 23 minutes. This outlier causing the 11-minute difference in average might be from a reporting error or perhaps an abnormal circumstance preventing a member of the customer service team from responding within the average timeframe.

It’s worth noting that outliers shouldn’t automatically be discarded. Taking a second look at the data may help uncover a deeper or different issue causing the outlier.

Additional resources for learning more about outliers

What is probability?

The definition of probability is simply the likelihood that an event will happen. Probability isn’t a guarantee, but rather a guide of what may occur and how likely it is to occur based on the number of possible outcomes.

Probability is measured from 0 (impossibility) to 1 (certainty) and can be shown as a fraction (⅙), decimal (0.6), or percentage (60%).

(Note: although it exceeds the depth of this article, there are four different types of probability - classical, empirical, subjective, and axiomatic. Learn more about the types of probability here.)

Probability examples

The most straightforward example of probability is a coin toss. Since there are only two possible outcomes - heads or tails - each one has a 50% probability of occuring. Another probability example is rolling a die. There’s a one in six (or 16%) chance that you will roll a four.

Probability formula

Before calculating probability, it’s helpful to understand the specific meaning of a few words in the context of probability.

  • Experiment or trial: this refers to any action where the outcome is uncertain (e.g. rolling dice, spinning a spinner, flipping a coin, etc.).
  • Sample space: this includes all possible outcomes of an experiment (e.g. 36 possible outcomes from rolling 2 dice).
  • Event: this is one or more outcomes from an experiment (e.g. rolling doubles).

Side note: There are a couple different types of events which can impact how the probability is calculated.

  • Independent - Each event is not affected by any other events (e.g. when flipping a coin, each toss is perfectly isolated).
  • Dependent - Each event can be affected by previous events (e.g. drawing names for a gift exchange - once a name is drawn, the remaining possible names to draw from is reduced).
  • Mutually exclusive - Both events can’t happen at the same time (e.g. turning left or right, flipping a coin, etc.).

The basic formula to calculate the probability of an event is to divide the number of ways the event could happen by the total number of possible outcomes.

# of ways the event can happen / # of total possible outcomes = Probability that an event will occur

Using the example of rolling 2 dice, this is how to calculate the likelihood of rolling doubles.

6 (doubles can be rolled 6 different ways) / 36 (total possible outcomes from rolling 2 dice) = 16% (also shown as 0.16 or ⅙) probability of rolling doubles

Obviously, probability gets more complicated when calculating the likelihood of conditional or dependent events such as drawing a red spade followed by a black heart. To learn more about calculating the probability of dependent events, see here.

Probability and the gambler’s fallacy

Sometimes when we’re looking at the probability of future events, past outcomes can play tricks on us. This ‘trick’ is called gambler’s fallacy.

This is also known as the Monte Carlo Fallacy because of an infamous example that occurred at a roulette table there in 1913. The ball fell in black 26 times in a row and gamblers lost millions betting against black, assuming the streak had to end. However, the chance of black is always the same as red regardless of what’s happened in the past, because the underlying probability is unchanged since a roulette table doesn’t have a memory.

Another similar example is assuming a coin that’s landed on heads the past 15 times will land on tails next. However, each toss is independent and the probability remains consistent: 50% for heads and 50% for tails.

Additional resources for learning more about probability

What is qualitative data in statistics?

Qualitative is defined as information that describes or categorizes something. It answers the broad question of “What qualities does this have?”

Qualitative data cannot be easily measured or counted and therefore often doesn’t contain numbers. For example, you might interview customers to determine which social media platform they use most. You would then categorize the responses by platform such as Facebook, Twitter, Quora, Snapchat, etc.

Or an ecommerce retailer may poll shoppers to see which color - teal, gray, or white - is preferable for a specific item. (Note: if you combine all the results from the poll - e.g. 45 teal, 70 gray, and 52 white, this becomes quantitative data.)

In some instances, a number or code may be assigned to qualitative descriptions or categories. For example, a company may assign numbers 1-5 to a satisfaction survey: Very satisfied (5), Satisfied (4), Somewhat satisfied (3), Somewhat dissatisfied (2), and Dissatisfied (1). (You might be wondering, does this turn it into quantitative data? That’s a great question with a complicated answer. You can learn more here about the debate on this type of data - ordinal data.)

Because qualitative data describes, it’s often subjective and relative such as cheap, expensive, smaller, larger, sweet, sour, highly engaged, disengaged, etc.

It’s worth noting that most people use the term ‘qualitative data’ more loosely in business than the pure statistical definition (above). The more general business use refers to user interviews or research - rich information that cannot be measured.

Types of qualitative data

There are three types of qualitative data: binomial data, nominal data, and ordinal data.

  • Binomial data (or binary data): this divides information into two mutually exclusive groups. Examples of binary data are true/false, right/wrong, accept/reject, etc.
  • Nominal data (or unordered data): this groups information into categories that do not have implicit ranking. Nominal data examples include colors, genres, occupations, geographic location, etc.
  • Ordinal data (or ordered data): as the name implies, information is categorized with an implied order. Examples of ordinal data are small/medium/large, unsatisfied/neutral/satisfied, etc.

What’s the difference between quantitative and qualitative data?

The terms quantitative and qualitative data are often mentioned together, so it’s important to understand the distinction between the two.

Qualitative data is information that describes or categorizes. This involves qualities.

Quantitative data is information that measures or counts. This involves numbers such as monthly revenue, distance of a race and time of the winner, calories in a meal, temperature, or salary. There’s a more complete definition of quantitative data here (including examples).

Additional resources for learning more about qualitative data

Quantitative data definition

Quantitative data is anything that can be measured or counted. This is also called numeric data because it deals with numbers.

There’s a wide range of quantitative data examples in statistics such as monthly revenue, distance of a race and time of the winner, calories in a meal, temperature, salary, etc.

What are the different types of quantitative data?

There are two types of quantitative data: continuous data and discrete data.

  • Continuous data: this is information that can be measured. It refers to one point within a range (or continuum). Technically, continuous data can be infinitely more precise. For example, if you use a scale at home, your dog may weigh 35 pounds. But the veterinarian’s scale might show more precisely that the dog weighs 35 pounds and 7.63 ounces. Other examples of continuous data include the speed of a car, the weight of a toddler, the time a train departs, and the rate of revenue growth.
  • Discrete data: this is information that can be counted. Generally, discrete data contains integers (i.e. finite values) and cannot be more precise. For example, the number of goldfish in an aquarium is discrete since they can be physically counted and it’s impossible to have 3.7 goldfish. Other examples of discrete data include number of customers, number of languages a person speaks, and number of apps on your phone.

What’s the difference between quantitative and qualitative data?

The terms quantitative and qualitative data are often mentioned together, so it’s important to understand the distinction between the two.

Quantitative data is information that measures or counts. This involves numbers.

Qualitative data is information that describes and categorizes. This involves qualities such as the color of the sky, the smell of perfume, music genres, or coffee bean flavors. There’s a more in-depth definition of qualitative data with more detail and examples.

Additional resources for learning about quantitative data

What is regression in statistics?

Regression in statistics is a useful way of making predictions about data. Regression shows the most likely outcome based on a trend of one or more known data points (predictors) - and the impact of changing one of the predictors.

For example, we might predict the most likely grade point average (GPA) students would earn in college based on their annual aptitude scores in high school.

The more technical definition of regression is the strength of relationship between one or more independent data points (variables we can change or predictors) and one dependent data point (the predicted outcome).

Continuing with the example above, regression allows us to estimate how much higher students’ college GPA might be if their aptitude scores were improved by 2 points every year.

Regression example

Let’s look at another regression example. Suppose you want to predict how much money you could make by investing in mutual funds over the next 10 years. The known data points (predictors or variables you can change) in this example would be how much money you contribute, how frequently you contribute, and the past performance of the mutual funds. By adjusting any one of those variables, you can predict how your return on investment may increase or decrease.

Regression equation and regression line

Regression equation and regression line are two important terms to know. A regression line is the trend that emerges when we plot our known data (predictors) on a graph. And the way we plot our data is by using a regression equation - a mathematical formula where we can plug in our known data to calculate the predicted outcome. There are different types of regression equations, but the most common one is the linear regression equation. (Learn more about regression equations here.)

By using a regression equation to visualize our data on a graph, we can more easily see how the outcome might change when one or more predictors change.

Types of regression

For different types of data, there are different types of regression. (Defining the data types in the context of regressions exceeds the depth of this explanation, but you can learn more about both types here.)

The types of regression include linear, logistic, ridge, LASSO, polynomial, Bayesian, and others. At a very high level, the difference between regressions is the shape or arc of the regression line. Basically, each regression has a different shape when visualized on a graph (reflecting the varied patterns of the data being visualized).

(It’s worth noting that each type of regression has its own equation.)

What is the regression fallacy?

The regression fallacy, more commonly called regression toward the mean, is when something happens that’s unusually good or bad reverts back towards the average (i.e. regresses toward the mean). This statistical fallacy occurs anywhere random chance plays a part in the outcome.

For example, success in business is often a combination of both skill and luck. This means that the best-performing companies today are likely to be much closer to average in 10 years time, not through incompetence but because today they’re likely benefitting from a string of good luck – like rolling a double-six repeatedly.

Regression vs. correlation

Correlation shows what, if any, relationship exists between two data points. (Learn more about correlation here.)

Regression involves causation where one piece of information (outcome) is the effect of one or more other data points. Also, regression allows us to ‘play’ with the outcome by changing the independent data.

For example, we could see how fluctuating oil costs would impact gasoline prices.

Additional resources to learn more about regressions

What is sampling error?

Sampling error is the variation between the entire population (of data) and the sample. This variation is simply because the sample doesn’t (and can’t) perfectly reflect the whole.

The name can be confusing because ‘error’ is typically understood as ‘mistake.’ However, in data science and statistics, sampling error is defined as the difference between the subset (sample) and the whole.

How to reduce sampling error

The only way to completely eliminate sampling error is to test the entire population. Since this is often not feasible (e.g. polling the entire U.S population, measuring the efficiency of all flights worldwide, etc.), sampling error can be reduced by enlarging the sample size.

You can also calculate sampling error by using a specific sampling model. If you want to dive even deeper, it may be helpful to understand standard deviation.

What is the difference between sampling error and non-sampling error?

The term non-sampling error is more of a catch-all for mistakes that might be made when analyzing data (sampled or whole) or designing/collecting/reporting a sample. Examples of non-sampling errors include bias, inconsistent or missing data, measurement errors, poor sampling or questionnaire design, nonresponse, mistake in recording data, etc.

While sampling error is the inherent variation from the whole, non-sampling error refers to any extrinsic variation or mistake that distorts the perception of the whole.

Additional resources to learn more about sampling error

What is the meaning of summary statistics?

Summary statistics (or summary metrics) define a complicated set of data (or whole population) with some simple metrics. Basically, summary statistics summarize large amounts of data by describing key characteristics such as the average, distribution, potential correlation or dependence, etc.

Examples of summary statistics

Summary statistics usually fall into several broad categories: location, shape, spread, dependence, and order statistics. We’ll look at examples in each one.

Summary statistics that measure:

  • Average (or central tendency) - Where is the data centered? Where is the trend? Examples include mode, median, and mean.
  • Shape - How is the data distributed? What is the pattern? How is the data skewed? Examples include skewness or kurtosis and L-moments.
  • Spread - How varied or dispersed is the data? Examples include range, variance, and standard deviation (among others).
  • Dependence - If the data contains more than one variable, are the variables correlated or dependent? The primary example is correlation coefficient.

Which summary statistics to use

Deciding which summary statistics to use depends on which questions you need answered and/or the problem you’re trying to solve. Often before even looking at the data, it’s useful to articulate exactly what your goal or problem is. Check out this no-nonsense data analysis guide for help defining your problem.

There’s also another set of summary statistics - order statistics - that combines several of the above-mentioned metrics (e.g. average, shape, spread, etc.). The two most common order statistics are the five number summary and slightly expanded seven number summary. As the titles indicate, they include 5 and 7 specific numbers (respectively) that help define the entire data set. The advantage of using either of these order statistics is that you don’t have to decide which numbers (e.g. mode, median, skewness, L-moments, variance, etc.) to include in the summary because they’re already defined.

Danger of summary statistics

It’s helpful to visualize summary statistics on a graph because sometimes data sets can have the same summary metrics and yet be visually distinct as illustrated here.

To demonstrate the effect of this data fallacy, statistician Francis Anscombe put together four example data sets in the 1970s. Known as Anscombe’s Quartet, each data set has the same mean, variance, and correlation. However, when graphed, it’s clear that each of the data sets are totally different. The point that Anscombe wanted to make is that the shape of the data is as important as the summary metrics and cannot be ignored in analysis.

Additional resources to learn more about summary statistics

What is a variable?

A variable can mean slightly different things depending on the context. Here’s a quick overview of each contextual definition.

  • A variable in mathematics is a quantity that may change (in the context of a math problem) and is usually shown as a letter such as x or y. In more advanced math, a variable might represent a number, vector, matrix, or function.
  • A variable in computer science (or programming) is like a ‘container’ or ‘bucket’ that holds information. This allows the contents of container (pieces of information) to be referenced without requiring the name of a specific piece of information.
  • A variable in experiments (research) is anything that varies in quantity or quality. Research variables fall into three categories: independent, dependent, and controlled.
  • A variable in data sets is a property being measured (usually in a column).
  • A variable in statistics is an attribute that describes a person, place, thing, or idea and can vary over time or between data sets.

Types of variables in statistics

For a more detailed definition of variables in statistics, we need to look at the different types since each one has its own distinct meaning. Here are the most common types of variables you might encounter.

(Note: Sometimes variables have several different names, which can be confusing. We’ve only listed the most common names for simplicity).

Independent and dependent variables

An independent variable (sometimes called ‘predictor’ or ‘experimental’ variable) is the input of an experiment that can be manipulated to affect the dependent variable (sometimes called ‘outcome,’ ‘predicted,’ or ‘response’ variable). Independent variables can be controlled, dependent variables cannot be controlled.

For example, the size (i.e. diameter) of a garden hose would be an independent variable that affects the amount of water (dependent variable) that’s able to come out. By changing the size of the garden hose, we can increase or decrease the water flow.

Independent and dependent variables can be quantitative or qualitative.

Quantitative (or numeric) variables

  • Discrete - this type of variable is a finite number (i.e. it can be counted). Generally, discrete variables contain integers (whole numbers - not decimals or fractions) and cannot be more precise. For example, the number of pets a family has is a discrete variable - it’s impossible to have 2.5 dogs or 1.5 cats.
  • Continuous - this is the opposite of discrete since it represents an infinite number. It can refer to one point within a range (or continuum). Technically, continuous variables can be infinitely more precise. For example, the weight of a dog can always be more precise with a more precise scale.

Qualitative (or categorical) variables

  • Ordinal (or ranked variable) - this is descriptive variables that have an implied order or rank. Examples of ordinal variables are small/medium/large, unsatisfied/neutral/satisfied, etc.
  • Nominal - this is sometimes just called ‘categorical variable’ and refers to descriptive variables that do not have an implicit ranking. Nominal variable examples include colors, genres, occupations, geographic location, etc.

Number of variables: univariate vs. bivariate

When the data you’re analyzing has just one variable, the data set is called univariate data. If you’re analyzing the relationship between two variables, the data set is called bivariate data.

For example, the height of a group of people would be univariate data since there’s only one variable - height. But if we were to look at height AND weight, we’d be working with bivariate data since there are two variables.

Additional resources to learn more about variables

What is a Venn diagram?

A Venn diagram shows the similarities and differences of two or more data sets by using overlapping circles. (In the context of Venn diagrams, a ‘set’ is simply a collection of objects.) The overlapping areas show the similarities and the non-overlapping areas show the differences. Venn diagrams are particularly useful for showing the logical relationship between data sets.

For example, you could compare electric cars and gasoline-powered cars - a circle for each one. The two circles would overlap in the middle showing the number of hybrid cars that can operate on both gasoline AND an electric charge.

Although Venn diagrams can have unlimited circles (each circle representing a data set), they usually have just two or three overlapping circles (any more than three circles/data sets becomes quite complex).

Venn diagrams may also be called primary diagrams, set diagrams, or logic diagrams. It’s worth noting that Venn diagrams are not always quantitative - sometimes they’re purely illustrative for showing intersections between groups.

What is an example of a Venn diagram?

Here’s a straightforward example of a Venn diagram. Suppose you want to see the relationship between books available in hardcopy and books available on Kindle. There are a total of 45 books - 18 available in hard copy, 15 available on Kindle, and 12 books available in both formats.

By drawing two circles that overlap, you can see the relationship between the two sets of data. The similarity between the two sets is shown in the middle overlapping section.

Venn diagrams can become much more complex with more data sets (creating additional circles) and are often shaded to help better visualize the relationships between data sets.

Additional resources to learn more about Venn diagrams