Analysing Thalassemia Blood Test Results with Power BI

Download Power BI file from GitHub.

1. Project Overview

Thalassemia is a group of inherited blood disorders that affect the body’s ability to produce hemoglobin, a protein in red blood cells that carries oxygen. Individuals with Thalassemia may experience symptoms like anemia, fatigue, and, in severe cases, organ damage.

This Power BI project explores a publicly available Alpha Thalassemia Dataset, aiming to create an interactive dashboard that enables users to analyse key blood test parameters and identify differences between Thalassemia carriers from individuals with a normal blood profile for the purposes of education.

This project will demonstrate data cleaning, transformation, modelling and visualisation techniques in Power BI using Power Query, DAX, and interactive dashboards.

2. Data Source Context and Comparative Overview

The dataset used in this project was sourced from an online Kaggle repository titled Alpha Thalassemia Dataset - Carriers VS Normal. It comprises of data from 288 cases collected by the Human Genetics Unit (HGU) of the Faculty of Medicine, Colombo, Sri Lanka. The subjects include alpha thalassemia carrying children and their family members screened over a period of 2016 to 2020.

alphanorm

This dataset is designed for broad screening for a group of individuals to distinguish between normal subjects and those suspected of being alpha thalassemia carriers. It provides a baseline of laboratory results for identifying overall patterns and differences between healthy individuals and potential carriers, helping to set thresholds and identify key biomarkers in routine blood tests.

twoalphas

This dataset dives deeper into the carrier demographic and provides a more nuanced analysis of the subtypes of carriers. While it shares the same laboratory tests as the alphanorm dataset, it aims to differentiate between the distinct forms of alpha-thalassemia carriers i.e. those with alpha thalassemia trait and those that are silent carriers. This specific distuinghing is valuable for tailoring diagnostic approaches and understanding clinical presentation among carriers.

3. Data Transformation

The transformation process for the datasets involved several key steps in Power Query:

  • Ensuring proper data types for each column

  • Handling missing values

  • Standardising column names with units

  • Creating new calculated columns

This process was applied to two datasets: TwoAlphas and AlphaNorm

Handling Missing Values

To identify missing values, I clicked the dropdown arrow next to each column name and searched through the unique values selector to identify any blank or null values.

  • alphanorm had two null value in the MCH column and a null value in the RBC column

  • twoalphas had a null value in the MCH column

Null values were replaced with the mean value of the column, the mean value was chosen as the distibution of the values in those columns were evenly varied.

In twoalphas, the following M code was used to replace the null value with the mean of the column:

MeanMCH = List.Average(#"Changed Type"[mch]),
ReplaceNulls = Table.ReplaceValue(#"Changed Type", null, MeanMCH, Replacer.ReplaceValue, {"mch"})

In alphanorm, as there were null values in two columns, a custom function was used to reduce the use of redundant code:

let
    // Calculate mean, ignoring null values
    MeanValue = List.Average(List.RemoveNulls(Table.Column(inputTable, columnName))),
    // Replace nulls in the specified column with the mean
    ReplacedTable = Table.TransformColumns(inputTable, {{columnName, each if _ = null then MeanValue else _, type number}})
in
    ReplacedTable,
ReplaceNullRBC = ReplaceNullsWithMean(#"Changed Type", "rbc"),
ReplaceNullMCH = ReplaceNullsWithMean(ReplaceNullRBC, "mch")

Creating Calculated Columns

Two calculated columns were added to both tables for enhanced data analysis.

Haemoglobin Category - labels the subject as low if the Hb value is less than 10 g/dL or else as normal

#"Added Custom" = Table.AddColumn(#"Renamed Columns", "Haemoglobin Category",
    each if [#"Hb (g/dL)"] < 10 then "Low" else "Normal")

RBC to Hb Ratio - ratio between the RBC value and the Hb value (RBC / Hb)

#"Added Custom1" = Table.AddColumn(#"Added Custom", "RBC/Hb Ratio",
    each [#"RBC (10^12/L)"] / [#"Hb (g/dL)"])

4. Data Model

The two independant tables contain descriptive attributes like phenotype and sex. The data is anonymised and there are no shared identifiers, therefore it’s not possible to establish any relationship between the two. There is also no need to implement any schema structure as the two independant tables are sufficient.

Summarised Averages per Phenotype using DAX

alphanorm_summary = SUMMARIZE('alphanorm (1)', 'alphanorm (1)'[Phenotype],
    "Avg_Hb", AVERAGE('alphanorm (1)'[Hb (g/dL)]),
    "Avg_RBC", AVERAGE('alphanorm (1)'[RBC (10^12/L)]),
    "%_Low_Hb", DIVIDE(CALCULATE(COUNTROWS('alphanorm (1)'), 'alphanorm (1)'[Haemoglobin Category] = "Low"), COUNTROWS('alphanorm (1)')))

twoalphas_summary = SUMMARIZE('twoalphas (1)', 'twoalphas (1)'[Phenotype],
    "Avg_Hb", AVERAGE('twoalphas (1)'[Hb (g/dL)]),
    "Avg_RBC", AVERAGE('twoalphas (1)'[RBC (10^12/L)]),
    "%_Low_Hb", DIVIDE(CALCULATE(COUNTROWS('twoalphas (1)'), 'twoalphas (1)'[Hb Category] = "Low"), COUNTROWS('twoalphas (1)')))

0
Subscribe to my newsletter

Read articles from Ahamad Tawsif Chowdhury directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ahamad Tawsif Chowdhury
Ahamad Tawsif Chowdhury