Data Cleansing – What It Is And How To Use It

03.02.23 Collecting data Time to read: 8min

How do you like this article?

0 Reviews


Data-Cleansing-Definition

Data cleansing, also known as data cleaning or scrubbing, is the process of identifying, correcting or removing errors, inconsistencies, and inaccuracies from data to enhance the quality. The errors can range from simple issues like misspellings to more complicated ones like missing data or mislabeled categories. Accurate data cleansing improves the reliability of the ensuing data analysis and, consequently, the validity of insights or decisions derived from it. Despite being time-consuming, it’s a necessary measure in research methodology to ensure the data used is as accurate as possible.

Data Cleansing – In a Nutshell

  • Without proper data cleansing, errors and inconsistencies may lead to bias and false conclusions.
  • The aim of data cleansing is to improve the data quality for analysis and decision-making.
  • Data cleansing is a crucial step in data preparation before any statistical analysis or hypothesis testing.

Definition: Data cleansing

Data cleansing refers to removing or correcting inaccurate, incomplete, or irrelevant data to improve quality and consistency. Its purpose is to resolve errors in the data to make it accurate, complete, and valid. An error in data can be defined as any deviation from the expected values or patterns. Here is a step-by-step process for data cleansing:

  • Data validation
  • Data screening
  • Diagnosing data entries
  • Developing codes
  • Transforming/ removing data

The importance of data cleansing

Data plays a crucial role in quantitative research, as it makes inferences and predictions about a given population. Tools such as statistical analyses are used to analyze data in quantitative research, and hypothesis testing is used to test the validity of research findings. However, if data is not cleansed properly, it can lead to bias in research results, such as information bias or omitted variable bias.

Example

If a study on the effectiveness of a new medication includes only patients with a particular medical condition, the results may not be generalizable to the larger population.

Data cleansing – Distinguish dirty from clean data

Dirty data is data that contains inconsistencies and errors. Three common sources of dirty data include:

  • Poor research design
  • Data entry errors
  • Inconsistent formatting
Dirty Data Clean Data
Invalid Valid
Inaccurate Accurate
Incomplete Complete
Inconsistent Consistent
Duplicate Unique
Falsley Formatted Uniform

Valid vs. invalid data

Valid data meets the criteria for data validation, such as being within a specific range. Invalid data doesn’t meet these criteria and may be removed or corrected during the data cleansing.

Example

Ensuring that all participants in a study are within the specified age limit.

Accurate vs. inaccurate data

Accurate data doesn’t have errors and inconsistencies, while inaccurate data contains errors or inconsistencies.

Example

A participant’s age is recorded as 25 when they are 35.

Complete vs. incomplete data

Complete data is fully recorded and contains no missing values, while incomplete data contains missing values. Incomplete data can be reconstructed using methods such as imputation or multiple imputations.

Example

A survey missing the responses for certain questions.

Consistent vs. inconsistent data

Consistent data agrees with other data and doesn’t contain any contradictions. In contrast, inconsistent data contains contradictions or discrepancies.

Example

A participant’s height is recorded as 6 feet in one survey and 6’1″ in another.

Unique vs. duplicate data

Unique data is distinct and not duplicated, while duplicate data is identical to other data.

Example

Having two records for the same participant in a study. Eliminating duplicate data through data cleansing is necessary to prevent inaccuracies in the analysis.

Uniform vs. falsely formatted data

Uniform data follows a consistent format and structure, while falsely formatted data deviates from the established format.

Example

A participant’s phone number being recorded in different formats in different surveys.

Data cleansing – How to do it

Effective data cleansing is crucial for accurate and reliable quantitative research. It’s important to consider the potential hurdles that may occur during data cleansing, like missing values, outliers, or incorrect formatting. To effectively cleanse data, various techniques can be used, such as data validation, data screening, data diagnosis, code development, and data transformation/ removal.

Data cleansing workflow

A data cleansing workflow is a structured approach to identifying and correcting errors, inconsistencies, and inaccuracies in data. Documenting a data cleansing workflow helps to ensure consistency and reproducibility of results. The various steps of a data cleansing workflow include:

  • Data validation to avoid dirty data: Check data for errors/inconsistencies and remove or correct invalid data.
  • Data screening for errors: Identify data inconsistencies, like missing values or outliers.
  • Diagnosing data entries: Examine individual data entries to identify and correct errors/inconsistencies.
  • Developing codes: Create codes or rules for cleaning and transforming data.
  • Transforming or removing data: Cleanse and transform data for more accurate and reliable analysis.

Data cleansing – Validation

Data validation is a technique to ensure that data meets specific criteria before storing or processing. This can include checking for errors and inconsistencies and removing or correcting invalid data. Data validation is relevant when collecting data to ensure that it’s accurate and reliable for analysis. There are several types of data validation constraints, including:

Data-type constraints

Ensure that data is of a particular type, such as a number or a string. This can include checking if a phone number or date is entered in the correct format.

Example

Ensuring that all participants in a study are over the age of 18.

Range constraints

Ensure that data falls within a specific range. This can include checking that a participant’s age is between 18 and 65 or their weight is between 50 and 200 pounds.

Example

Ensuring that all participants in a study have a BMI within a healthy range.

Mandatory constraints

Ensure that certain data is present before it is stored or processed. This can include checking that a required field is not empty or that a certain number of responses are collected.

Example

Ensuring that all participants in a study have provided their names and contact information.

Data cleansing – Screening

Storing a duplicate of data collection is vital for data screening, as it helps analyze the original data with the cleaned data. The process of data cleansing involves:

Step 1: Structuring the dataset

This involves organizing and formatting data to make it more accurate and reliable for analysis. Important steps to consider when straightening up a dataset include:

  • Sorting data
  • Removing duplicates
  • Standardizing formatting

Step 2: Scanning data for inconsistencies

The second step in data cleansing involves identifying any errors or inconsistencies in the data, such as missing values or outliers. Questions to consider when scanning data for inconsistencies include:

  • Looking for missing data
  • Identifying outliers
  • Checking for patterns in the data.

Step 3: Using statistical methods to explore data

Descriptive statistics are crucial in detecting distributions, outliers, and skewness in data. These methods include:

  • Boxplots, scatterplots, histograms – used to visualize data and identify patterns and outliers.
  • Normal distribution a statistical model that can identify abnormal data points.
  • Mean, median, mode – can summarize data and identify patterns and outliers.
  • Frequency tables identify the most common values in a dataset and outliers.

Data cleansing – Diagnosing

Diagnosing data is the process of assessing the data quality in a dataset. This step is crucial for understanding potential issues that may arise when working with the data, such as inaccuracies, inconsistencies, and missing values. If data isn’t properly diagnosed, it can lead to inaccurate conclusions and poor decision-making. Some common problems in dirty data include:

  • Duplicate data: Data that appears multiple times in dataset
  • Invalid data: Data that doesn’t conform to the expected format or values
  • Missing values: Data that is missing in particular fields or observations
  • Outliers: Data that is significantly different from the majority of the data in the dataset

Removing duplicate data

Deduplication is the process of identifying and removing duplicate data from a dataset.

Example

Using a unique identifier, such as a primary key, to identify and delete duplicate rows in a dataset.

Invalid data

Data standardization ensures that data conforms to a specific format or set of rules. This method helps ensure consistency and accuracy in the data.

Example

A phone number field that includes letters or symbols.

Strict string-matching and fuzzy string-matching are methods used to identify and correct invalid data. Strict string-matching compares data precisely as entered, while fuzzy string-matching allows for slight variations in the data.

Example

If the invalid data is a list of customer names and addresses, strict string-matching would only match “John Smith” to “John Smith,” while fuzzy string-matching would match “John Smith” to “Jhon Smit.” After matching, the next step is to correct or remove the invalid data.

Data cleansing – Missing data

Random missing data is missing data that occurs entirely at random, while non-random missing data is missing data that is related to the data’s characteristics. Missing data can be tackled by:

Accepting: Leaving the missing data as is and treating it as a separate category
Removing: Deleting observations or fields with missing data
Recreating; Using statistical methods to estimate missing data.

Example

Removing all observations in a dataset with missing values for a specific field.

You can, however, use imputation to replace missing data with estimated values. To use imputation properly, it’s important to understand the underlying causes of the missing data and to use appropriate statistical methods to estimate the missing values.

Data cleansing – Outliers

Outliers in a dataset are values significantly different from most data. Outliers can be either true values or errors.

True Outliners Error Outliners
Genuine values that are unusual or unexpected Values that are the result of errors or mistakes in data collection or entry

Identifying outliers

Common methods to detect outliers in a dataset include:

  • Using statistical tests such as Z-scores or the interquartile range
  • Using visualization methods like box plots or scatter plots
  • Comparing data to expected values or ranges.

Retaining or removing outliers

There are several methods for retaining or removing outliers once they are identified in a dataset. One is to remove the outliers or to simply keep them but scale them differently. Sometimes, it may be best to keep the outliers and use them to inform the analysis. However, it is important to document any outliers found and the decision made about handling them.

FAQs

Data cleansing is done by identifying and correcting errors in data, such as missing values, duplicate values, or outliers.

Data cleansing is important because it ensures the accuracy and integrity of the data.

Yes, data cleansing can be automated using various tools and software, such as data quality software, data integration software, and data governance software.

The frequency of data cleansing will depend on the specific use case and the nature of the data. Some organizations may perform data*cleansing on a daily or weekly basis, while others may only need to do so on a monthly or quarterly basis.