3 Steps to Clean Data in SQL Server

I am obsessed with SQL Server now mainly because of its power. This time, I will show 3 steps to clean data, also involving some tips for easy use.

The demo dataset I downloaded is covid-19 deaths, considering the end of the pandemic, it is time to reflect on the three years.

You can find the original dataset from Our World in Data: Covid-19 Deaths.

You can find the final dashboard on my Power BI portfolio: Covid Deaths R.I.P.

0. Preparation Work

Create new database → Download data → Import (SQL Server Import and Export Wizard) → Destination: SQL Server Native Client 11.0/Server name/Database

<div style="text-align: center"></div>

1.Delete redundant rows and columns

In the dataset, the data has four columns: continent, location, date, total_deaths

There are 2 issues needed to fix:

  1. the date started too early and there is a period of time when the pandemic has not begun

  2. the location includes bigger areas, like continents and income segmentation

  3. the continent is not needed in the final analysis, but it is needed when deleted unuseful locations

<div style="text-align: center"></div>

So the code is

-- Step1 row: Delete rows where total_deaths or continent is null
DELETE FROM [PortfolioProject].[dbo].[CovidDeathsCountry]
WHERE total_deaths IS NULL OR continent IS NULL;
-- Step1 column: Delete columns that are not used
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry]
DROP COLUMN continent;

Then I get the data that I need

<div style="text-align: center"></div>

2. Be careful about the datatype

Although SQL Server identifies datatypes automatically, it is necessary to check each datatype, in this case, the data type is not correct: total_deaths is varchar (255).

So I replaced the datatype with BIGINT:

-- Step2: Alter datatype
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry]
ALTER COLUMN total_deaths BIGINT;

3. Aggregate data

Because I need a total number globally, I need to add the total deaths on the same day.

Firstly, I need to add a new column named: world_deaths

-- Step3: Add columns world_deaths
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry]
ADD world_deaths BIGINT;

Then I used a subquery to calculate the sum then assign the value to the corresponding cells.

UPDATE [PortfolioProject].[dbo].[CovidDeathsCountry]
SET world_deaths = subquery.sum_deaths
FROM (
       SELECT date,
       SUM(total_deaths) AS sum_deaths
       FROM [PortfolioProject].[dbo].[CovidDeathsCountry]
       GROUP BY date) AS subquery
WHERE [PortfolioProject].[dbo].[CovidDeathsCountry].date = subquery.date;

I got the table that I could use in Power BI now, right-click then Save Results AS

<div style="text-align: center"></div>

Final Tips

Shortcuts

I watched a video talking about easy-to-use tips, thanks to Ike Ellis.

I only listed 2 tips I used a lot this time.

The first one is that Query Shortcuts (Tools/Options/Environment/Keyboard/Query Shortcuts)

Because I often checked if the data was cleaned correctly, I used Ctrl+F3 to count the rows, and Ctrl+F4 to select the top 1000 rows.

<div style="text-align: center"></div>

Then I could choose the table and use the shortcuts to check the current stage quickly.

Customised Colour

I often change themes or colours to create a comfortable environment. This time I changed the location and colour of the status bar.

The setting is located in Tools/Options/Text Editor/Editor Tab and Status Bar. I set the location as the top, and the colour is pink.

Now I can be happy to play around with SQL Server.