Data Platform Design for Healthcare Research
Table of Content
Chapter 1 - Project Overview
Chapter 2 - Design of Schema
Chapter 3 - Data Privacy and Security
Chapter 4 - Extract, Transform, and Load (ETL)
Chapter 5 - Monitor and Maintain
Chapter 6 - Future Work
Chapter 1 Project Overview
The project is inspired by a data engineer job description posted by Our Future Health, to identify how diseases begin and progress in people from different backgrounds. They would like to harness genetic and clinical data to create a more detailed picture of people’s health.
MySQL is a relational database which stores structured data (each row represents one record with different features), it can create several tables with relationships that connect them together, and the purpose of storing data is to reduce duplicate data and retrieve data conveniently.
There are several objectives:
Integration of genetics and healthcare data
Easily monitor and maintain in the future work
Predicted outcome: research-ready, well-curated and well-documented data
Potential audience:
Researchers (bioinformatics scientists, surgeons, oncologists, etc)
Managers in NHS
Public
Funding Organisations
Government Health Sectors
Potential stakeholders
IT (database establishment and maintenance)
Public (disease trend)
Researchers (needed data)
Chapter 2 Design of Schema
Step 1: Initial mindmap
Tables in the schema are divided into two parts:
Clinical data
Genetic data
As the mindmap shows, clinical data comes from NHS and genetic data comes from genetic detection organisations. The clinical data involves patient general information, and admission and treatment records. The genetic data involves detailed mutated genes and frequency and other parameters.
Step 2: Use drawSQL to map it
8 Tables for clinical data:
Patient
Primary care
City
Country
Admission record
Treatment
Medication
Surgery
4 Tables for genetic data:
Sample
Detection Organisation
Genetic Info
Gene
Step 3: Import SQL into MySQL
I use drawSQL to generate SQL queries, and then use MySQL to run them.
Chapter 3 Data Privacy and Security
The preparation of the database has finished, before populating the database, it is essential to talk about ethical considerations as the data is connected to patient privacy. Patients may suffer potential harm when their data is used for research purposes like revealing sensitive information. NHS currently has information governance which protects patients. "Secondary use" must only use data that will not identify individuals.
The pre-requisite includes:
Obtain the appropriate consent from patients
de-identify the data to protect patient privacy
ensure data is only used for approved purposes
Relevant technology includes:
Data anonymisation
Data encryption
Data access control
Data masking
so on
Chapter 4 Extract, Transform, and Load (ETL)
The data comes from two sources: Electronic Health Records (EHRs) and genetic detection organisations. There are several steps to do it:
obtain permission from administrators and patients
establish the relationships between EHR and the database (needed data)
extract data from EHRs (APIs, SQL)
transform: revise the message formats and map coding systems (Apache Spark)
load: import data and ensure the data is complete and accurate
Chapter 5 Monitor and Maintain
When data is generated with time and simulated in the database, the ETL process should be implemented periodically. Set up scheduled ingestion of data from the EHR databases to an analytical database (Airflow)
at the correct time
with a specific interval
in the right order
Chapter 6 Future Work
Coming back to the original purposes, the database is for healthcare researchers, considering the security and management of the database, maybe it is a good idea to set up a separate database for data analysis or views in MySQL for researchers.
Last but not least, the update for the database is an iterated process. Getting feedback from researchers or NHS managers, and adding or deleting columns will be followed in the future.
The slides I presented here. Feel free to comment on it.