I am obsessed with SQL Server now mainly because of its power. This time, I will show 3 steps to clean data, also involving some tips for easy use.
The demo dataset I downloaded is covid-19 deaths, considering the end of the pandemic, it is time to reflect on the three years.
You can find the original dataset from Our World in Data: Covid-19 Deaths.
You can find the final dashboard on my Power BI portfolio: Covid Deaths R.I.P.
You can find the complete SQL code on my GitHub.
0. Preparation Work
Create new database → Download data → Import (SQL Server Import and Export Wizard) → Destination: SQL Server Native Client 11.0/Server name/Database
1.Delete redundant rows and columns
In the dataset, the data has four columns: continent, location, date, total_deaths
There are 2 issues needed to fix:
the date started too early and there is a period of time when the pandemic has not begun
the location includes bigger areas, like continents and income segmentation
the continent is not needed in the final analysis, but it is needed when deleted unuseful locations
So the code is
-- Step1 row: Delete rows where total_deaths or continent is null
DELETE FROM [PortfolioProject].[dbo].[CovidDeathsCountry] WHERE total_deaths IS NULL OR continent IS NULL;
-- Step1 column: Delete columns that are not used
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry] DROP COLUMN continent;
Then I get the data that I need
2. Be careful about the datatype
Although SQL Server identifies datatypes automatically, it is necessary to check each datatype, in this case, the data type is not correct: total_deaths is varchar (255).
So I replaced the datatype with BIGINT:
-- Step2: Alter datatype
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry] ALTER COLUMN total_deaths BIGINT;
3. Aggregate data
Because I need a total number globally, I need to add the total deaths on the same day.
Firstly, I need to add a new column named: world_deaths
-- Step3: Add columns world_deaths
ALTER TABLE [PortfolioProject].[dbo].[CovidDeathsCountry] ADD world_deaths BIGINT;
Then I used a subquery to calculate the sum then assign the value to the corresponding cells.
UPDATE [PortfolioProject].[dbo].[CovidDeathsCountry] SET world_deaths = subquery.sum_deaths FROM ( SELECT date, SUM(total_deaths) AS sum_deaths FROM [PortfolioProject].[dbo].[CovidDeathsCountry] GROUP BY date) AS subquery WHERE [PortfolioProject].[dbo].[CovidDeathsCountry].date = subquery.date;
I got the table that I could use in Power BI now, right-click then Save Results AS
I watched a video talking about easy-to-use tips, thanks to Ike Ellis.
I only listed 2 tips I used a lot this time.
The first one is that Query Shortcuts (Tools/Options/Environment/Keyboard/Query Shortcuts)
Because I often checked if the data was cleaned correctly, I used Ctrl+F3 to count the rows, and Ctrl+F4 to select the top 1000 rows.
Then I could choose the table and use the shortcuts to check the current stage quickly.
I often change themes or colours to create a comfortable environment. This time I changed the location and colour of the status bar.
The setting is located in Tools/Options/Text Editor/Editor Tab and Status Bar. I set the location as the top, and the colour is pink.
Now I can be happy to play around with SQL Server.