Data cleansing, also known as data cleaning or scrubbing, is the process of identifying, correcting or removing errors, inconsistencies, and inaccuracies from data to enhance the quality. The errors can range from simple issues like misspellings to more complicated ones like missing data or mislabelled categories. Accurate data cleansing improves the reliability of the ensuing data analysis and, consequently, the validity of insights or decisions derived from it. Despite being time-consuming, it’s a necessary measure in research methodology to ensure the data used is as accurate as possible.
Definition: Data cleansing
Data cleansing refers to removing or correcting inaccurate, incomplete, or irrelevant data to improve quality and consistency. Its purpose is to resolve errors in the data to make it accurate, complete, and valid. An error in data can be defined as any deviation from the expected values or patterns. Here is a step-by-step process for data cleansing:
- Data validation
- Data screening
- Diagnosing data entries
- Developing codes
- Transforming/ removing data
The importance of data cleansing
Data plays a crucial role in quantitative research, as it makes inferences and predictions about a given population. Tools such as statistical analyses are used to analyse data in quantitative research, and hypothesis testing is used to test the validity of research findings. However, if data is not cleansed properly, it can lead to bias in research results, such as information bias or omitted variable bias.
Data cleansing – Distinguish dirty from clean data
Dirty data is data that contains inconsistencies and errors. Three common sources of dirty data include:
- Poor research design
- Data entry errors
- Inconsistent formatting
Dirty Data | Clean Data |
Invalid | Valid |
Inaccurate | Accurate |
Incomplete | Complete |
Inconsistent | Consistent |
Duplicate | Unique |
Falsley Formatted | Uniform |
Valid vs. invalid data
Valid data meets the criteria for data validation, such as being within a specific range. Invalid data doesn’t meet these criteria and may be removed or corrected during the data cleansing.
Accurate vs. inaccurate data
Accurate data doesn’t have errors and inconsistencies, while inaccurate data contains errors or inconsistencies.
Complete vs. incomplete data
Complete data is fully recorded and contains no missing values, while incomplete data contains missing values. Incomplete data can be reconstructed using methods such as imputation or multiple imputations.
Consistent vs. inconsistent data
Consistent data agrees with other data and doesn’t contain any contradictions. In contrast, inconsistent data contains contradictions or discrepancies.
Unique vs. duplicate data
Unique data is distinct and not duplicated, while duplicate data is identical to other data.
Uniform vs. falsely formatted data
Uniform data follows a consistent format and structure, while falsely formatted data deviates from the established format.
Data cleansing – How to do it
Effective data cleansing is crucial for accurate and reliable quantitative research. It’s important to consider the potential hurdles that may occur during data cleansing, like missing values, outliers, or incorrect formatting. To effectively cleanse data, various techniques can be used, such as data validation, data screening, data diagnosis, code development, and data transformation/ removal.
Data cleansing workflow
A data cleansing workflow is a structured approach to identifying and correcting errors, inconsistencies, and inaccuracies in data. Documenting a data cleansing workflow helps to ensure consistency and reproducibility of results. The various steps of a data cleansing workflow include:
- Data validation to avoid dirty data: Check data for errors/inconsistencies and remove or correct invalid data.
- Data screening for errors: Identify data inconsistencies, like missing values or outliers.
- Diagnosing data entries: Examine individual data entries to identify and correct errors/inconsistencies.
- Developing codes: Create codes or rules for cleaning and transforming data.
- Transforming or removing data: Cleanse and transform data for more accurate and reliable analysis.
Data cleansing – Validation
Data validation is a technique to ensure that data meets specific criteria before storing or processing. This can include checking for errors and inconsistencies and removing or correcting invalid data. Data validation is relevant when collecting data to ensure that it’s accurate and reliable for analysis. There are several types of data validation constraints, including:
Data-type constraints
Ensure that data is of a particular type, such as a number or a string. This can include checking if a phone number or date is entered in the correct format.
Range constraints
Ensure that data autumns within a specific range. This can include checking that a participant’s age is between 18 and 65 or their weight is between 50 and 200 pounds.
Mandatory constraints
Ensure that certain data is present before it is stored or processed. This can include checking that a required field is not empty or that a certain number of responses are collected.
Data cleansing – Screening
Storing a duplicate of data collection is vital for data screening, as it helps analyse the original data with the cleaned data. The process of data cleansing involves:
Step 1: Structuring the dataset
This involves organizing and formatting data to make it more accurate and reliable for analysis. Important steps to consider when straightening up a dataset include:
- Sorting data
- Removing duplicates
- Standardizing formatting
Step 2: Scanning data for inconsistencies
The second step in data cleansing involves identifying any errors or inconsistencies in the data, such as missing values or outliers. Questions to consider when scanning data for inconsistencies include:
- Looking for missing data
- Identifying outliers
- Checking for patterns in the data.
Step 3: Using statistical methods to explore data
Descriptive statistics are crucial in detecting distributions, outliers, and skewness in data. These methods include:
- Boxplots, scatterplots, histograms – used to visualize data and identify patterns and outliers.
- Normal distribution – a statistical model that can identify abnormal data points.
- Mean, median, mode – can summarize data and identify patterns and outliers.
- Frequency tables – identify the most common values in a dataset and outliers.
Data cleansing – Diagnosing
Diagnosing data is the process of assessing the data quality in a dataset. This step is crucial for understanding potential issues that may arise when working with the data, such as inaccuracies, inconsistencies, and missing values. If data isn’t properly diagnosed, it can lead to inaccurate conclusions and poor decision-making. Some common problems in dirty data include:
- Duplicate data: Data that appears multiple times in dataset
- Invalid data: Data that doesn’t conform to the expected format or values
- Missing values: Data that is missing in particular fields or observations
- Outliers: Data that is significantly different from the majority of the data in the dataset
Removing duplicate data
Deduplication is the process of identifying and removing duplicate data from a dataset.
Invalid data
Data standardization ensures that data conforms to a specific format or set of rules. This method helps ensure consistency and accuracy in the data.
Strict string-matching and fuzzy string-matching are methods used to identify and correct invalid data. Strict string-matching compares data precisely as entered, while fuzzy string-matching allows for slight variations in the data.
Data cleansing – Missing data
Random missing data is missing data that occurs entirely at random, while non-random missing data is missing data that is related to the data’s characteristics. Missing data can be tackled by:
Accepting: | Leaving the missing data as is and treating it as a separate category |
Removing: | Deleting observations or fields with missing data |
Recreating; | Using statistical methods to estimate missing data. |
You can, however, use imputation to replace missing data with estimated values. To use imputation properly, it’s important to understand the underlying causes of the missing data and to use appropriate statistical methods to estimate the missing values.
Data cleansing – Outliers
Outliers in a dataset are values significantly different from most data. Outliers can be either true values or errors.
True Outliners | Error Outliners |
Genuine values that are unusual or unexpected | Values that are the result of errors or mistakes in data collection or entry |
Identifying outliers
Common methods to detect outliers in a dataset include:
- Using statistical tests such as Z-scores or the interquartile range
- Using visualization methods like box plots or scatter plots
- Comparing data to expected values or ranges.
Retaining or removing outliers
There are several methods for retaining or removing outliers once they are identified in a dataset. One is to remove the outliers or to simply keep them but scale them differently. Sometimes, it may be best to keep the outliers and use them to inform the analysis. However, it is important to document any outliers found and the decision made about handling them.
FAQs
Data cleansing is done by identifying and correcting errors in data, such as missing values, duplicate values, or outliers.
Data cleansing is important because it ensures the accuracy and integrity of the data.
Yes, data cleansing can be automated using various tools and software, such as data quality software, data integration software, and data governance software.
The frequency of data cleansing will depend on the specific use case and the nature of the data. Some organizations may perform data*cleansing on a daily or weekly basis, while others may only need to do so on a monthly or quarterly basis.