Data Outliers – What Are They And How Do You Find Them?

16.04.23 Statistical data Time to read: 9min

How do you like this article?

0 Reviews


Data-outliers-01

Detecting anomalies is a crucial aspect of any company’s operations, achievable through various methods, including outlier analysis. In the realm of statistics, outliers can dramatically skew results, particularly when dealing with small sample sizes, thereby affecting averages and influencing the conclusions drawn. Understanding and managing these outliers is an essential step in data analysis.

Data Outliers – In a Nutshell

  • Data outliers are values that exhibit a substantial deviation.
  • Data outliers can broadly be categorized into three types: global, contextual, and collective.
  • Data outliers resulting from natural variation in a population are referred to as true outliers, and data outliers arising from measurement errors or incorrect data entry are false outliers.
  • The various methods of finding data outliers are sorting data, graphing data, calculating z-scores, using the interquartile range, the Turkey method, and hypotheses tests.
  • Removing data outliers from a dataset can improve the accuracy of statistical analysis and prevent misleading results. However, it can also remove valuable information from the dataset.

Definition: Data outliers

Data outliers are values that differ significantly from the rest of the dataset. They represent information on out-of-the-ordinary behavior, such as numbers far from the norm for the variable in question.

Depending on the overall context of the data, outliers can signify an error in data collection or measurement or interesting anomalies such as rare occurrences or extreme values.

According to Roshandel, M. Reza, 2022, data outliers are problematic in statistics because they:

  • Affect the accuracy of estimates and even cause bias
  • Skew the average if they aren’t distributed randomly
  • Affect the validity of statistical hypotheses like regression
  • Make statistical tests less predictive when error variance increases

Example

In a dataset of test scores for a class, an outlier could be an extreme 98% score when the rest of the scores range from 60% to 80%. This outlier could be due to various factors, such as cheating or a mistake in grading. If this outlier is not identified and dealt with, it could skew the overall average score for the class, leading to incorrect conclusions or decisions based on the data.

Different types of data outliers

There are various data outliers, but broadly speaking, they can be categorized into three types:

  1. Global
  2. Contextual
  3. Collective

These are the most basic kind of data outliers. They are data points that are significantly different from other points in the dataset.

These types of data outliers can occur due to measurement errors or represent extreme values in the population. Existing outlier detection methods often focus on finding global outliers.

These data points are not considered when in isolation but become outliers when contextual factors are considered.

Example

A high temperature reading on a hot summer day may not be an outlier on its own, but it would be considered an outlier on a cold winter day.

Without relevant background information, it can be challenging to identify contextual outliers in a given set of data. For this reason, it’s essential to have a description of the context at hand while looking for them.

These are groups of data points that are significantly different from the other groups in the dataset. These types of data outliers can occur when there are subgroups or clusters of data with different characteristics or errors in the grouping or categorization of data.

True and false data outliers

Data outliers can occur due to different reasons, and it’s important to distinguish between true outliers resulting from natural variation in a population and outliers arising from measurement errors or incorrect data entry.

Outliers resulting from natural variation in the population are referred to as true/real outliers. These data outliers can provide valuable insights into the data distribution and the population’s characteristics under study.

Example

If you have a dataset on the heights of people in a population, a true outlier could be the height of an exceptionally tall person, which would reflect the natural variation in the height of the population.

In contrast, data outliers arising from measurement errors or incorrect data entry are known as false/spurious outliers. These data outliers can distort the distribution of the data and can result in incorrect conclusions or analyses.

Example

If you have a dataset on the weights of individuals, an extreme recorded weight due to a measurement or data entry error is a spurious outlier.

When dealing with many outliers or a skewed distribution, it’s essential to consider what caused them.

If the outliers are due to measurement errors or incorrect data entry, they should be corrected or removed from the dataset.

However, if the outliers are due to natural variation in the population, they should be retained in the dataset.

Finding data outliers

Detecting data outliers is essential in data analysis as it ensures data quality, statistical significance, and accurate modeling. There are several techniques for finding data outliers, each of which takes a slightly different approach.

Sorting your data

Sorting your data is an easy way to identify outliers because it allows you to see the extreme values at a glance. You can sort your data in ascending or descending order, depending on the variable of interest, and then check for abnormally high or low values.

Example

If you are analyzing a dataset of exam scores: 78, 85, 92, 67, 88, 99, 76, 25, 84, 92, sorting the data would look like 25, 67, 76, 78, 84, 85, 88, 92, 92, 99.

In this case, you can easily spot that the score of 25 is an outlier.

Graphing your data

Graphs like scatter plots and histograms can also be used to identify data outliers. Graphs provide a visual representation of your data, making it simple to highlight patterns and spot outliers.

Outliers will manifest as points or bars that display a notable deviation from the rest of the dataset. A scatter plot displays data points as dots on an x-y graph, depending on two variables.

Since most points tend to cluster together, scatter plots make it simple to spot outliers. The extreme number is the outlier.

In contrast, histograms use bars to organize data in ranges. The data ranges are shown along the x-axis, while the other variable is along the y-axis. This helps in locating outliers in the data. For example, if most data points fall into the right-hand side of the graph, but one of the bins sits on the extreme left, that left bin stands out as an anomaly.

Calculating z-scores

A Z-score represents the number of standard deviations a data point differs from the mean of a dataset. Z-scores are obtained by taking a data point, subtracting the mean, and then dividing by the standard deviation.

The formula for calculating a z-score is:

z = (X – μ) / σ

Where:

  • z = z-score
  • x = data point
  • μ = mean of the dataset
  • σ = standard deviation of the dataset

If the Z-score for a data point is significantly greater or lower than 0, then that data point is considered an outlier.

Example

If your data points have z-scores of -0.32, -0.15, -5.2, -0.29, and -0.19, respectively, the one with a z-score of -5.2 stands out as the outlier.

Using interquartile range

The interquartile range (IQR) is a measure of variability that tells how spread out the middle 50% of the dataset is. To identify data outliers using the IQR method, a general rule is to consider any data point that falls more than 1.5 times the IQR below the first quartile or above the third quartile as an outlier.

Q1 and Q3 denote the first and third quartiles of a dataset and the quantitative difference between the upper and lower quartiles of the dataset.

Tukey method

The Tukey method, also known as the Tukey fence method, is a variation of the IQR method that uses fences to identify data outliers.

A fence is a threshold value beyond which any data point is considered an outlier. The fences are calculated by adding or subtracting a multiple of the IQR from the first and third quartiles. Any data point that falls outside of the fences is considered an outlier.

The fences are calculated as follows:

  • Lower fence = Q1 – (1.5 x IQR)
  • Upper fence = Q3 + (1.5 x IQR)

Hypothesis tests

Hypothesis tests such as Grubb’s and Peirce’s criteria are more advanced methods for identifying data outliers. They are typically used when the data is suspected to be from a normally distributed population.

These tests are useful when the dataset is large, and the outliers are challenging to detect using simple methods like sorting or graphing the data. It’s crucial to exercise caution when using these tests because they presume that the data is normally distributed, and may not be applicable to data sets that have other types of distributions.

Removing data outliers

Detecting and understanding data outliers is essential in statistics for several reasons:

They can significantly impact statistical analyses, as they can skew the results and make them less accurate.

Data outliers can provide important information about the data-generating process, such as errors in measurement, data entry, or data collection.

They may represent rare or extreme events of particular interest, such as anomalies in financial data.

It’s important to note that removing data outliers can also remove valuable information from the dataset. In some cases, data outliers represent values important to the analysis.

Example

Outlier data points in finance can represent high-risk investments or exceptional returns. Therefore, it’s essential to carefully consider whether to remove outliers from a dataset.

When to and not to remove a data outlier

Removing data outliers from a dataset can improve the accuracy of statistical analysis and prevent misleading results. However, whether to eliminate an outlier depends on whether it’s representative of your study’s population, topic, research question, and approach.

If the outlier is:

  • A data entry or measurement error, you can correct it. However, if the error cannot be fixed, removing it is the best solution.
  • Not representative of the population under study (i.e., having exceptional traits or situations), it can be legitimately excluded.
  • A natural part of the population you’re studying, you shouldn’t get rid of it.

How do you remove outliers from your dataset?

Some of the common methods for removing data outliers are:

  • Using standard deviation:
    This method is suited for Normally/Gaussian distributed data. It involves taking 3 standard deviations from the mean of the values to calculate the upper and lower boundary.
  • Interquartile range (IQR) method:
    It involves calculating the IQR of a dataset and removing any data point that falls outside the range of (1.5 x IQR) away from the first or third quartile.
  • Visual inspection:
    This method involves visually inspecting a plot of the data and identifying any points that are far from the other points.

It’s important to note that these methods may not always be appropriate. Therefore, you should consider the context of the analysis and the study’s goals before retaining or removing data outliers.

Printing Your Thesis With BachelorPrint

  • High-quality bindings with customizable embossing
  • 3D live preview to check your work before ordering
  • Free express delivery

Configure your binding now!

to printing services

FAQs

Data outliers are not always bad, but they can indicate errors or unusual patterns in the data that may warrant further investigation.

Preventing outliers in data can be challenging, but some measures can be taken, such as:

  • Improving data quality
  • Using appropriate measurement techniques
  • Setting realistic thresholds for data values

For skewed datasets, the best method for removing the outliers is the IQR method. However, if the data is normally distributed, which is often the case, it’s best to use the standard deviation.