Health Analytics Module 2: Session 1

Session 1

Data and data preprocessing

The purpose of this session will be to reintroduce us to the concept of data. We will look at different data types, the important accepts of data that inform the health analytics process, and how we can present our data to our target audience.

Common Tech Stacks in Data Analytics

The following is a brief description of the tools and technologies within each layer of a modern data stack used in data analytics: 1....

Tech Stacks in Data Analytics

In today's data-driven world, data is considered one of the most valuable assets for organisations. It drives decision-making, powers...

Data Science Tools used in Health Science

Data science tools play a crucial role in transforming healthcare by making vast amounts of health data easier to analyze, interpret, and...

Evidence of Learning

Reflection: Google Cloud Healthcare API for Interoperable Health Information Systems

The rapid acceleration of healthcare data generation is both fascinating and daunting. With diverse sources such as hospitals, insurance...

Harnessing Big Data Analytics for Health Care: A Focus on Google Cloud's BigQuery

The rise of Big Data Analytics (BDA) in healthcare has been transformative, particularly in the digital age where large amounts of data...

Data Sources in Health

Data sources in health research are diverse and essential for understanding health trends, outcomes, and social factors. They include:...

Coursera: The Data Science of Health Informatics

Explores the application of data science in health informatics, focusing on data integration and decision-making

Enroll Here

YoutubeGoogle Cloud Healthcare API

Demonstrates the use of Google Cloud Healthcare API for interoperable health information systems.

Watch Here

National Library of Medicine: Health Data Sources Quiz

Tests knowledge of various health data sources and their roles in healthcare analytics.

Take Quiz

Applying Pandas to a Dataset

01.

Installing and Loading Pandas

pip install pandas

import pandas as pd

df = pd.read_csv('dataset.csv') # Replace 'dataset.csv' with your file path

02.

Explore the Data

print(df.head()) # View the first few rows

print(df.info()) # Check data types and non-null counts

print(df.describe()) # Get summary statistics for numerical columns

See how to use Pandas Profiling

03.

Clean and Preprocess the Data

# Detecting missing values
print(data.isnull().sum())

# Filling missing values with the mean
data.fillna(data.mean(), inplace=True)

# Dropping rows with missing values
data.dropna(inplace=True)

For better readability, we can rename columns

04.

Operations to Analyse the Data

filtered_df = df[df['column_name'] > 10] # Filter rows

grouped = df.groupby('category_column').mean() # Group by a column and find averages

print(grouped)

Advanced Data Analytics Functions

05.

Visualise the Data

df['column_name'].plot(kind='hist') # Create a histogram

df.plot(x='x_column', y='y_column', kind='line') # Line plot

Data Visualisation with Pandas

Evaluating Data Quality

Data quality is a cornerstone of effective analysis and decision-making. If you're working with Python, tools like YData-Profiling and Great Expectations can make evaluating and maintaining data quality a seamless process.

YData-Profiling:

What is YData-Profiling?

YData-Profiling (formerly Pandas-Profiling) is an excellent tool for automatic exploratory data analysis (EDA), profiling, and even comparing datasets. It can generate detailed insights into your dataset’s structure, distributions, and interactions.

Installation and Setup

Install the package using pip:

pip install ydata_profiling

Then, import the necessary functions in your Python environment:

python:

from ydata_profiling import ProfileReport, compare import pandas as pd

Creating a Profile Report

To generate a profile report, load your data into a DataFrame and call the profiling functions:

data = pd.read_csv("heart.csv") report = ProfileReport(data, title="Heart Disease Data Profile") report.to_file("profile_report.html")

Alternatively, view the report directly in a Jupyter Notebook:

report.to_notebook_iframe()

The resulting report includes:

Overview: Key dataset statistics, variable distributions, and missing values.
Alerts: Warnings on potential data quality issues, such as duplicates or constant columns.
Reproducibility Information: Details to replicate the analysis.

Handling Large Datasets

For extensive datasets, set the minimal argument to streamline profiling:

report = ProfileReport(data, title="Minimal Profile", minimal=True)

You can also specify certain variables for detailed interactions to avoid overloading computations:

report.config.interactions.targets = ["RestingBP", "Cholesterol"] report.df = data report.to_file("focused_profile.html")

Comparing Datasets

Compare multiple datasets or versions using the compare function:

train_report = ProfileReport(train_df, title="Training Data") test_report = ProfileReport(test_df, title="Testing Data") comparison = train_report.compare(test_report) comparison.to_file("comparison.html")

YData-Profiling is particularly useful when working with sensitive data, as it operates locally without sending data to external services.

Great Expectations: Automating Data Validation

What is Great Expectations?

Great Expectations is a robust framework for data validation, offering tools to define and test assumptions about your data. It’s ideal for setting up automated quality checks in data pipelines.

Installation and Setup

Install the package using pip:

pip install great_expectations

Import the library and load your dataset:

import great_expectations as gx data = gx.read_csv("heart.csv")

You can also convert a Pandas DataFrame into a Great Expectations object:

import pandas as pd data = pd.read_csv("heart.csv") data = gx.from_pandas(data)

Defining Expectations

Expectations are declarative statements about your data. For example:

Ensuring columns match a specific set:

data.expect_table_columns_to_match_set( column_set=['Age', 'Sex', 'Cholesterol'], exact_match=True )
Checking for unique values in a column:

data.expect_column_values_to_be_unique(column="id")["success"]

Verifying data completeness with thresholds:

data.expect_column_values_to_not_be_null("Age", mostly=0.80)["success"]

Visualising Results

Great Expectations provides a web-based interface, Data Docs, for reviewing validation outcomes. It highlights passed and failed expectations with detailed explanations.