decorative image on sql data cleaning.

Best Practices SQL Data Cleaning

SQL datasets often require data cleaning. Whenever we start working on a dataset, the first thing we need to ensure is that data is clean. But what does it mean when we say data is not clean and why is cleaning it so important? Let’s understand it with the help of an example.

To begin, imagine a DVD rental store. A customer named Bob (Customer id 302), rented a DVD of film ‘Forest Gump’ (Film id 104), via sales representative Charlie (employee id 204) for $2 for 6 days. Purchase id is 904.

What if this record appears multiple times in your database? Even worse, in some rows customer name is changed, and in some rows Charlie is misspelled as Barlie. This will render any analysis impossible and would be so frustrating, right?

Hence, cleaning data is always the first step before analyzing it. No matter how good your code and visualization technique are, you’ll most likely get misleading results if data is not clean. Learning how to clean your data using SQL is an essential skill.

In this guide, we’ll walk through the best practices for SQL data cleaning, covering common issues and practical solutions.

Why Is Data Cleaning Important?

In practical world, there will almost always be errors in raw data. It will have errors, duplicates, inconsistencies and missing values. Following are the benefits of cleaning the dataset –

  1. It becomes easier to make reports and analysis and accuracy improves too.
  2. It gets ensured that datasets are consistent across tables.
  3. Enables others to use that data without having to reach out to you for clarifications.
  4. Helps in decision-making, prevents future errors and inefficiencies.

Don’t worry, as SQL (Structured Query Language) provides some awesome tools to identify and fix these issues.

Common Data Quality Issues

Before we deep dive into SQL techniques, let’s understand some common data problems:

  1. Duplicate Records – Same data / record is there many times in a table.
  2. Missing Values – No value / NULL in important fields.
  3. Inconsistent Data Formats – Different date formats. Sometime it is DD/MM/YYYY and then sometime MM/DD/YYYY. This is just one example.
  4. Incorrect Data – Typos, outdated information or confusing entries. Remember the Charlie/Barlie example?
  5. Irrelevant Data – Unnecessary records that clutter analysis. Too much data without any purpose leads to clutter.
  6. Data Type Mismatches – Incorrect data types causing errors or inconsistencies. For ex – placing $ in some records before currency and not placing in some.

Now, let’s see how SQL can help address these issues.

1. Removing Duplicate Records – 

As we know, duplicate records can destroy our analysis by skewing data. Imagine the largest number of the dataset appearing multiple times and how that would impact your analysis. However, we can identify and remove duplicates using SQL’s DISTINCT, GROUP BY, or ROW_NUMBER() functions.

a. Finding Duplicates –

Below query can help in finding customers who appear more than once –

“SELECT name, email, COUNT(*)
FROM customers
GROUP BY name, email
HAVING COUNT(*) > 1;”

b. Deleting Duplicates –

This deletes duplicate records while keeping one unique entry.

“WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn
FROM customers
)
DELETE FROM customers WHERE id IN (
SELECT id FROM cte WHERE rn > 1
);”

2. Handling Missing Values –

As we discussed earlier, missing data can result in incomplete analysis. We can handle it by –

a. Replacing NULLs with a Default Value –

Let’s see how we can assign a default salary of 30k to employees with missing salaries –

“UPDATE employees
SET salary = 30000
WHERE salary IS NULL;”

b. Using COALESCE to Fill Missing Data –

We can replace NULL email addresses with a placeholder using COALESCE.

“SELECT name, COALESCE(email, ‘not_provided@example.com’) AS email
FROM customers;”

c. Using Aggregate Functions for Missing Data –

We can replace missing salaries with the average salary from the available data.

“UPDATE employees
SET salary = (SELECT AVG(salary) FROM employees WHERE salary IS NOT NULL)
WHERE salary IS NULL;”

3. Standardizing Data Formats –

When we combine data from multiple sources, it is common to face inconsistency in data formats. Here are some ways to standardize it using SQL –

a. Fixing Case Inconsistencies –

We can standardize columns in proper case by using INITCAP. For ex -“John Doe” instead of “john DOE”.

UPDATE customers
SET name = INITCAP(name);”

b. Standardizing Date Formats –

This is a very common problem in SQL and Excel. For ‘3 Feb 2025’, some would write 03/02/2025 and some would prefer 02/03/2025 or even other formats like 03.02.2025 etc. Hence this becomes confusing for other users. Thus we need to standardize with the following query –

“UPDATE orders
SET order_date = TO_DATE(order_date, ‘YYYY-MM-DD’);”

c. Ensuring Consistent Phone Number Formatting –

Then, imagine coming across phone numbers written in different formats like +9174653… or 74653…. We can handle it by removing non-numeric characters –

“UPDATE customers
SET phone_number = REGEXP_REPLACE (phone_number, ‘[^0-9]’, ”, ‘g’);”

4. Correcting Incorrect Data –

Quite often, incorrect data entries like typos or outdated values can cause confusion. However, SQL can help identify and fix such issues.

a. Finding Mismatched Data –

Following query can identify department names that are incorrect / typos as we only have HR, Sales, Marketing and IT –

“SELECT * FROM employees
WHERE department NOT IN (‘HR’, ‘Sales’, ‘Marketing’, ‘IT’);”

b. Updating Incorrect Values –

If we see one particular typo occurring many times, we can handle it like this –

“UPDATE employees
SET department = ‘Marketing’
WHERE department = ‘Markting’;”

5. Removing Irrelevant Data for SQL Data Cleaning –

It is strongly recommended to keep removing unnecessary data from time to time. Now let’s see how we can remove them using SQL –

a. Deleting Outdated data –

There is no point in keeping outdated records, let’s say records older than 2020. Hence we should delete those like this –

“DELETE FROM orders
WHERE order_date < ‘2020-01-01’;”

b. Removing Unwanted Columns –

we can drop a column if it is not required.

“ALTER TABLE customers
DROP COLUMN old_phone_number;”

6. Using Constraints to Prevent Future Issues –

While we take care of incorrect / outdated data and typos, it is equally important to prevent future issues from happening. SQL constraints can help maintain data integrity in following ways –

a. Enforcing Unique Values –

If someone tries to enter the same email for 2 customers, we can restrict it as follows –

ALTER TABLE customers
ADD CONSTRAINT unique_email UNIQUE (email);”

b. Setting Default Values –

Furthermore, this is how we can set a default salary of 30k to new employees –

“ALTER TABLE employees
ALTER COLUMN salary SET DEFAULT 30000;”

c. Restricting Invalid Entries –

Obviously, negative salary can’t be possible, right? Hence we need to restrict it by –

“ALTER TABLE employees
ADD CONSTRAINT check_salary CHECK (salary > 0);”

7. Automating SQL Data Cleaning with Stored Procedures –

Cleaning data is not a one time activity. Hence we should automate tasks using stored procedures.

For e.g.: Removing Duplicates Automatically –

“CREATE PROCEDURE remove_duplicates()
AS $$
BEGIN
DELETE FROM customers
WHERE id IN (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn
FROM customers
) temp WHERE rn > 1
);
END;
$$ LANGUAGE plpgsql;”

Going forward, we can schedule this procedure to run periodically, thus keeping our data clean automatically.

Final Thoughts on SQL Data Cleaning

Evidently, cleaning data in SQL is an ongoing process. By applying these best practices, you can maintain a clean, consistent, and reliable database. Whether you’re removing duplicates, handling missing values, or standardizing formats, SQL provides powerful tools to keep your data in top shape.

Additionally, incorporating data validation checks, constraints, and automation ensures long-term data integrity and prevents recurring issues. Start incorporating these SQL cleaning techniques in your workflow, and you’ll see a significant improvement in your data quality!

All the best!

Please also explore our guide on SQL Query Optimization for faster performance.

You can also check out advance SQL data cleaning techniques – Advance Data Cleaning in SQL

Drop your queries in comments and we will answer.

Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *